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A SPECIAL REVIEW OF 
The American Soldier, Vol. IV* 


PHILIP J. MCCARTHY 
CORNELL UNIVERSITY 


Volume IV of The American Soldier presents the first complete 
account of the scale analysis approach of Guttman and the latent 
structure approach of Lazarsfeld to the problem of attitude meas- 
urement. This review has been prepared for the purpose of provid- 
ing an expository account of the models proposed by Guttman and 
Lazarsfeld, together with an indication of the places which call for 
additional clarification and research. 


1. Introduction 


This paper will present an expository review of Volume IV of 
The American Soldier. The materials in this volume are divided into 
two more or less distinct parts. The first eleven chapters deal with 
theoretical and empirical analyses of problems of measurement—in 
particular, the scale analysis approach of Louis Guttman and the lat- 
ent structure approach of Paul Lazarsfeld. The last five chapters are 
devoted to an account of two specific studies in prediction, namely, the 
screening of psychoneurotics in the army and the postwar plans of 
soldiers. This review will be focussed upon the contents of the first 
eleven chapters. 

Seale analysis and latent structure analysis are of fundamental 
importance because they provide a conceptual framework (or, a mod- 
el) with which to attack the problem of attitude measurement. In 
particular, they attempt to test the hypothesis that a delimited area 
of human behavior (which may refer to observed reactions to specific 
situations, or to verbal expressions of feeling toward specific situa- 
tions, and so on) contains only a single dimension (the hypothesis of 
unidimensionality). If the hypothesis of unidimensionality is ac- 
cepted, then one can consider the task of arranging people in rank 
order with respect to this single dimension. This does not necessarily 
mean that ordering cannot be carried out if the hypotheses of uni- 
dimensionality is rejected—only that the ordering is more meaning- 

*This review was originally prepared at the request of the Sociological Re- 


search Association, and was presented at the Association’s Annual Meeting in 
New York City, December, 1949. 
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ful if a single dimension is present. Volume IV of The American 
Soldier gives the first complete account of the work of Guttman and 
Lazarsfeld. This work will not solve all problems of attitude meas- 
urement. However, it will permit other research workers to examine 
the models which have been set up, to test them out empirically in a 
wide range of situations, and to change or discard them as empirical 
evidence grows. 

Whenever one sets up a model or theoretical framework for de- 
scribing a particular phenomenon, the following questions arise more 
or less naturally: 


(1) Does the model at least provide a logical description of the 
phenomenon under study? 

(2) Can the model be subjected to objective verification? 

(3) When the model holds, does it permit one to make important 
and useful deductions about the phenomenon? 

(4) Does the model fit a wide enough range of cases to make it 
practically useful? 


It is with respect to these four questions that scale and latent struc- 
ture analysis will now be examined. 


2. The Model for Scale Analysis 


(a) The approach. Guttman advocates considering an attitude 
as a delimited totality of behavior with respect to something. He does 
not take this as a complete definition, but only as a necessary compo- 
nent in a definition. Moreover, it is a component which can, hope- 
fully, be given an operational meaning. Once this operational mean- 
ing has been discovered and observed, then one can attempt to weave 
the results back into a social-psychological definition of an attitude. 
. The notion of an attitude as a totality of behavior toward some- 

thing requires that the population of individuals whose behavior is to 
be observed is specified. The moment in time at which the observed 
behavior occurs is a necessary part of the population definition. An- 
other moment in time may show a different pattern of behavior. In 
this section it will be assumed that the behavior of the entire popu- 
lation is observed. In practice, it will usually be necessary to study 
only a sample from the population and this aspect of the problem will 
be touched upon in the next section. 

Having defined the population of individuals with respect to 
which the attitude is to be defined, the next implication of the above 
definition is that the behavior of each population element toward 
“something” must be observed. What is this “something” to be? Ac- 
tually, this can be interpreted very broadly. Each individual may be 











PHILIP J. MCCARTHY 249 


placed in a series of concrete situations, and his reactions to each 
situation may be observed; or each individual may be asked a series 
of questions, and his answer to each question may be recorded. This 
latter case is the one which ordinarily occurs in attitude research, 
and it is the only one in which scale analysis has thus far been ap- 
plied to any appreciable extent. Narrowing the range of “something” 
to the asking of questions then leaves the problem—what questions? 

Although it has not been explicitly brought out in the preceding 
paragraphs, one is usually in the position of being able to state, at 
least in broad terms, the general nature of the attitude under inves- 
tigation. It may be attitude toward Russia, attitude toward the United 
Nations, attitude toward socialized medicine, or attitude toward low- 
cost housing. However, whatever the specific topic may be, there will 
always exist (at least conceptually) an indefinitely large number of 
questions which might be asked. The aggregate of all these questions 
is termed by Guttman the wniverse of content, or the universe of 
attributes. Here, just as for individuals, the universe must be sampled 
in order to carry out a practical study; and again the sampling as- 
pects will be relegated to the later portions of the review. For the 
moment, the existence of this universe of content will be taken for 
granted. 

Ideally, the answer of each person in the population to each ques- 
tion in the universe of content is obtained and recorded. These re- 
sponses can then be examined as they relate to the following two fun- 
damental problems of attitude measurement: 


(1) Does the pattern give evidence concerning the hypothesis of 
unidimensionality ? 

(2) If only one dimension is present, can people be given a 
unique rank order with respect to favorableness or unfa- 
vorableness on the attitude? 


Guttman proposes one particular pattern of responses—namely, the 
scale pattern—for which it seems to be possible to give an unequivo- 
cal answer to these questions. 

(b) The model or scale pattern. Assume that the universe of 
content is made up of questions, each of which consists of a series of 
categories graded from favorable to unfavorable (or vice versa). It 
is not necessary that the number of categories be limited to two (i.e., 
a dichotomy). A person responds to a question by placing himself in 
the category which most nearly mirrors his position. Thus in study- 
ing the attitude of enlisted men toward their officers, such questions 
as the following were asked: 
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“How do you feel about the privileges that officers get compared 
with those which enlisted men get? 


1. —————Officers have far too many privileges 

.. Officers have a few too many privileges 

3. Officers have about the right number of privileges 
4. Officers have too few privileges” 


Under these circumstances, it seems logical to state that one indi- 
vidual from the population will have a higher ranking in respect to 
the attitude than another individual if the frst is just as high or 
higher on every question or item in the universe of content than is the 
second. As a matter of fact, this is the definition of a scale. To quote: 
“We shall call a set of items of common content a scale if a person 
with a higher rank than another person is just as high or higher on 
every item than the other person.” 

The above definition of a scale immediately leads to the parallelo- 
gram response pattern. Consider a universe of three questions A, B, 
and C, each of which has three categories graded from favorable to 
unfavorable (A,, A., As; B,, B., Bs; C1, C2, Cs). It is assumed that 
each person in the population places himself in one category for each 
of the three questions. If a scale exists in the sense of the preceding 
definition, there are only seven possible response patterns. These are 
(assuming the smallest proportion endorse A, , the next smallest B, , 
and the next smallest C,): 


Rank of Individual Categories 
AY 2 Gf A BB G&G A B GC, 
6 x x x 
5 x x x 
4 x x x 
3 x x x 
2 x x x 
1 x x x 
0 x x x 


It might be noted that if the questions have more than two categories, 
one can obtain slight deviations from the perfect parallelogram pat- 
tern and still have the scale criterion satisfied. That is, the parallelo- 
gram pattern is nothing more than a visual aid in explaining scale 
theory. In the parallelogram pictured above, the reaction of the en- 
tire population to the entire universe of content can be completely 
specified by giving the proportion of people having rank 6, the pro- 
portion having rank 5, and so on. Notice particularly that knowledge 
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of a person’s rank immediately tells his pattern of responses. It is 
extremely easy to write down information and knowledge questions 
which one would expect to scale, and many simple illustrations of this 
type are given throughout Volume IV. 

The high degree of consistency required of the response pattern 
demands that the internal ordering of the categories in each question 
be with respect to some common element. Guttman refers to this by 
saying that all the items have a single content meaning. Moreover, 
the rank order of an individual contains all of the information which 
would be available if the responses to each of the questions were kept 
distinct. To put this in statistical terms, rank order is a sufficient 
statistic. These observations arise through a purely formal analysis 
and it would be a mistake at this stage to read more meaning into 
them than actually exists. The scale itself does not define or name con- 
tent any more than a correlation coefficient implies cause and effect. It 
is relative to the population (composition and time of questioning) 
and to the defined universe of content. A broader universe of content 
might or might not give a scale (any subset of the original universe 
will always give a scale). All of these points must eventually be given 
a psychological meaning, perhaps by tying this formal analysis back 
into a social-psychological definition of an attitude. 

One final comment on the scale or parallelogram pattern con- 
cerns the nature of items in the universe of content. It has already 
been noted that each item must contain two or more categories which 
can be naturally ranked from favorable to unfavorable. If a scale 
is to exist, these individual items must be essentially cumulative in 
nature. The prototype of items having an intrinsic cumulative char- 
acter is the social distance scale, with such items as the following 
(example from Chapter 1): 


1. Would you want a relative of yours to marry a Negro? 
(“Yes” or “No”). 

2. Would you invite a Negro to dinner at your home? 
(“Yes or “No”). 

3. Would you allow a Negro to vote? (“Yes” or “No”). 


It is possible to ask questions which do not have this cumulative char- 
acter, even though a single content meaning is still present. Under 
appropriate circumstances, this can lead to a pattern of responses 
different from the parallelogram pattern, where there still exists a 
one-to-one relationship between rank order and response pattern. This 
problem is discussed in Chapter 1 and undoubtedly deserves further 
investigation. 
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(c) Inferences drawn from the scale model. If a population of 
individuals and a universe of attributes produce a perfect scale, there 
are certain useful conclusions which can be drawn, either by a logical 
examination of the scale pattern or by the use of mathematical analy- 
sis. A brief account will now be given of the more important of these. 


Perhaps the most important aspect of the scale pattern is the sim- 
ple fact that a person’s responses to every item in the universe can 
be reproduced from a knowledge of his rank alone. Not only does 
this make the description of the “totality of behavior” a simple one, 
but it also means that a person’s rank order exhausts or summarizes 
the information which all items can give about him. Consequently, 
the conclusion which is stated so frequently in Vol. IV follows imme- 
diately: ‘The zero order correlation with the scale score is equivalent 
to the multiple correlation with the universe.” Guttman refers to this 
as the problem of external prediction. The above quotation implies 
that one is interested in a linear combination of the questions as a 
predictor. Rank order might not be the “best” predictor if curvi- 
linear regression were more appropriate for predicting a particular 
external variable. 

The second generally useful result flowing out of the existence 
of a scale is the manner in which it allows one to attack the problem 
of drawing a sample of questions from the universe of content. If a 
perfect scale exists, then one can be sure that any sample of questions 
selected from the universe will scale (i.e., form a cumulative or paral- 
lelogram pattern) and that people will be placed in their proper rank 
order. Individuals having the same rank in the sample of items might 
have different ranks if all items were used—and probably would. How- 
ever, a person having a higher sample rank than a second person 
would, of necessity, have a higher rank on the entire universe. In 
theory at least (and practice will be discussed in the next section), 
the existence of a scale solves the question-sampling problem. One 
does not need to use the entire universe, but only enough questions to 
provide the desired number of scale ranks. 


As noted above, an individual’s rank order may be all that is 
needed for external prediction. There is no need to worry about scores 
for people or weights for the categories of the various items. How- 
ever, it is sometimes necessary to inquire a bit more deeply into the 
internal relationships (i.e., within the scale pattern) between persons 
having different ranks, between different categories of the same ques- 
tion, and between different questions. Considerations of this kind in- 
troduce the problem of a metric. In other words, can individuals be 
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given a more meaningful quantitative score (on the scale from favor- 
able to unfavorable) than their rank order? 

Guttman has approached this question through the well-known 
principle of least squares. Quoting from Chapter 9: “The most inter- 
nally consistent scores to assign the people on the basis of their re- 
sponses to the items are those that satisfy the following condition. 
All people who fall in one category of an item should have scores as 
similar as possible among themselves, and as different as possible 
from the scores of the people in the other categories of the item; this 
should be true to the best extent possible for all items simultaneously.” 
These scores can be obtained without introducing the notion of cate- 
gory weights, and it is then possible to determine independently the 
most consistent category weights through the use of a criterion simi- 
lar to that set forth above. However, it can be proved that there is a 
relationship between the “best’’scores and the “best” weights. Briefly, 
the score of a person is proportional to the arithmetic mean of the 
weights of the categories by which he is characterized, and the weight 
of a category is proportional to the arithmetic mean of the scores of 
the people who are in it. As far as present applications of scale analy- 
sis are concerned, this fact seems to be used only as a justification 
for the simple scoring system which is ordinarily applied. However, 
it is of some theoretical interest to have this “duality” between per- 
sons and questions. 

In addition to justifying the use of a simple scoring system in 
the applications of scale analysis, the search for a metric also intro- 
duced the concept of principal components. It was found that there 
were more than one set of internally consistent scores, that is, scores 
from which weights could be derived which would give back the origi- 
nal scores (at least within a factor of proportionality). As a matter 
of fact, if the universe of attributes contains m different types of 
dichotomous items (two items being of the same type if they are per- 
fectly correlated), then there will be m different sets of scores. How- 
ever, these various sets of scores will differ in respect to their degree 
of internal consistency (see criterion above). This means that one 
now has three different ways of characterizing the way in which a 
person reacts to the universe of attributes, namely, (1) by his re- 
sponse pattern, (2) by his rank, or (3) by his score on each of the 
principal components. The third characterization will perhaps prove 
to be the most helpful in attempting to tie scale analysis back into a 
social-psychoiogical definition of attitude. Although a few advances 
in this direction have been made, most of this must wait on future 
research. 
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The most internally consistent set of scores has been called the 
“content” scores. It can be shown that an individual having a higher 
rank order than a second will also have a higher content score, and 
vice versa. There seems to be little doubt that this stands up under 
logical examination. The second most internally consistent set of 
scores (the second component) has been shown on a purely formal 
basis to be a U-shaped function when plotted against rank order. Ex- 
perimental work indicates that “intensity of feeling” for attitude ques- 
tions has a U-shaped relationship with rank order. In other words, 
the less favorable a person is on a certain attitude the more intensely 
he holds this position, and similarly for persons on the favorable end 
of the scale. The third component, considered as a function of rank 
order, will have two bends, and so on through all the components 
which exist in a particular universe of content. Each of these com- 
ponents is a perfect (but curvilinear) function of rank order. How- 
ever, they all have zero linear correlations with one another. 

All of the above outlined characteristics of the principal compo- 
nents are illustrated from a formal point of view in Chapters 1 and 
9 of Volume IV. In addition, the application of the second or inten- 
sity component to the work of the Research Branch is described. It 
would appear at the present time that there may be some hope in the 
future of isolating, defining, and measuring independently compo- 
nents above the second. As a matter of fact, Guttman, in a personal 
communication, has indicated that he has succeeded in empirical iso- 
lation of the third component. This has been called “closure.”” How- 
ever, there is no guarantee that this process can be carried on indefi- 
nitely, or even that there is any necessary correspondence between the 
mathematical components and the empirically determined psychologi- 
cal components. There need be no psychological equivalent of the 
quantities derived solely from mathematical analysis. This mathe- 
matical analysis can be taken as a guide for future research, but it 
cannot replace it. Furthermore, there are many fundamental aspects 
of the model itself which need to be investigated (“how to define uni- 
verse of content?” and “how frequently does the model apply?”) be- 
fore attention is devoted solely to the “overtones” of the principal 
components. 


8. The Scale Model in Practice 


(a) Deviations from a perfect scale. Thus far no populations 
and universes of content, at least insofar as the measurement of atti- 
tudes is concerned, have been found which give rise to a perfect scale. 
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This is not surprising if one only remembers the highly restrictive 
nature of the scale pattern. Notice that this observation does not de- 
pend upon studying the entire population and the entire universe of 
content. If a perfect scale exists, then any sample of individuals and 
any sample of items must also show the scale pattern. 

The first question that one may ask concerns the reasons for ob- 
taining deviations from a perfect scale. This can only occur because 
there is more than a single dimension operative in the reaction of in- 
dividuals to questions. This may imply that there are several major 
“content meanings” inherent in the universe of content, that indi- 
viduals are not consistent in their reaction to the questions (the prob- 
lem of test-retest reliability), or that there are many minor variables 
causing disturbances. The contents of Volume IV admittedly treat 
these problems from a more or less intuitive point of view. Until a 
body of experience has grown up, this may be all that one can do. How- 
ever, ultimately a more precise explanation must be given. 

At this stage in the development of scale theory, the classification 
of a deviant scale pattern into one of the above categories is based 
mainly on visual inspection. Thus, if the proportion of errors is small 
(and a discussion of “small’’ will be given shortly), and if they are 
scattered randomly through the various rank orders, an approximate 
scale is said to exist. In other words, it is assumed that many small 
variables—unimportant for descriptive or predictive purposes—are 
causing the disturbances. On the other hand, if the errors are rela- 
tively large and tend to be concentrated in a particular rank group, 
it is inferred that a major variable is distorting the pattern. The final 
type differentiated is the case where the errors occur most frequently 
in the midrank range, less frequently in the extreme rank individuals, 
and are distributed randomly among individuals and categories. This 
is referred to as a quasi-scale. Illustrations of these three types are 
given in Chapter 5. Here, as always, a more precise definition of ran- 
domness is required. However, even with such a definition, these 
three categories will tend to blend into one another and a sharp dis- 
tinction will probably never be possible. 

Concurrently with the types of deviations which may occur, one 
must also be interested in the amount of error which occurs. There 
are several different ways in which this error can be measured, two 
of which can be easily illustrated with the following pattern. 
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RESPONSE PATTERNS OF INDIVIDUALS 


Individuals Categories 
A, B, Cc, A, B, G, A; B, C, 
10 x x x 
9 x x x 
8 x x (x) 
‘| x x (x) 
6 x x x 
5 x x x 
4 (x) x x 
3 x x x 
2 x x (x) 
1 x x x 


The first and most obvious way to measure error is to compute the 
proportion of non-scale individuals. For the above pattern, this is 
clearly 4/10 or 40 per cent. The difficulty with this measure is that 
it does not take into account the number of items in the universe of 
content. Thus an individual is considered as one deviant whether his 
responses to all questions deviate from the scale pattern or whether 
only one response out of a large number deviates from the scale pat- 
tern. In this instance there are four deviant responses (enclosed in 
parentheses in the above diagram) out of a total of 30 responses—13 
per cent error. If we take one minus the proportion of error meas- 
ured in this way, we obtain Guttman’s coefficient of reproducibility. 
It would be possible to make still further refinements in the measure- 
ment of error if one considered not only the deviant responses but 
also the number of categories by which they deviate from the position 
they would occupy in a perfect scale. However, this would involve 
some idea of “distance” between categories. Also, it is not always 
- possible to tell exactly which of the responses are the deviant ones. 
All of this discussion is still in terms of the entire population and the 
entire universe of content. The sampling problem will come in the 
next subsection. 

Since perfect scales are not likely to exist, much less arise in 
practice, how much error can be tolerated before the theoretical ad- 
vantages of scale analysis begin to break down? It is at this point 
that Volume IV leaves one more or less with the impression of being 
suspended in mid-air. The following two guides are given: 

(1) If the population coefficient of reproducibility is in the 
neighborhood of .90 (and the errors are random), then it is implied 
that one will gain all the advantages of having a perfect scale (ranks 
will adequately predict responses, ranks will serve for external pre- 
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diction, and internal analyses for a metric, intensity and other com- 
ponents will be valid). 

(2) No matter how low the population coefficient of reproduci- 
bility may be, if the errors are distributed in the random gradient 
pattern of a quasi-scale then the ranks (which will no longer ade- 
quately reproduce responses) will still provide the best means of ex- 
ternal linear prediction. This implies that there is a major single 
content variable, but that it is disturbed by many small random ef- 
fects. 

It is recognized that the critical coefficient of reproducibility 
must be set somewhat arbitrarily at the present time, but there must 
have been some fortunate or unfortunate experiences which led to the 
choice of .90 as the critical value. An account of this previous experi- 
ence might serve as a useful guide in future research. 

(b) Testing the hypothesis of approximate scalability. The fun- 
damental problem in applying scale analysis to any particular situa- 
tion lies in determining whether or not an approximate scale actually 
exists. Once it has been determined that an approximate scale exists, 
the advantages set forth in the preceding section follow. The testing 
procedure depends upon defining the population of individuals and 
drawing a sample from that population, upon delimiting the universe 
of content and selecting a sample of items, upon observing the reac- 
tion of the individuals to the sample of items, upon computing the 
coefficient of reproducibility from the samples, and finally, upon in- 
ferring from the sample results what would have arisen had the en- 
tire population reacted to the entire universe of content. 

The definition and sampling of the population of individuals is 
not particularly different for scale analysis than for any other re- 
search problem since one desires to draw valid inferences from the 
sample concerning the entire population. 

The definition and sampling of the universe of content is a mucn 
more formidable task than is the sampling of people. Scale analysis 
cannot define content. It can only tell what to do with content after 
a scale has been shown to exist. The materials of Volume IV are not 
particularly helpful in this respect. It is all very well to state that one 
should first give the universe of content a name, should then write 
down all questions which seem to relate to this universe, and should 
then select a sample with which to test scalability. Although this may 
sound like a simple procedure, it is, in reality, a very complex one. 
This is clearly stated in Volume IV, and although some general rules 
are given to ensure that a spuriously high coefficient of reproduci- 
bility will not be obtained in the testing procedure (see succeeding 
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paragraphs), the major tasks still remain. Some advances in this 
direction have already been made by combining the Thurstone, Likert, 
and Guttman approaches to scale construction by Edwards and Kil- 
patrick (1). 

Once the sample of people has been selected and the sample of 
questions has been prepared, one enters the practical phase of asking 
the questions. There has been some doubt raised concerning the feasi- 
bility of maintaining rapport with respondents when many questions 
bearing on the same content are being used. This has been specifi- 
cally raised by Festinger (2), but the real test of this point can come 
only through experience. This reviewer has had no experience on 
this score. Neither has he had an opportunity to review the litera- 
ture and summarize others’ experience. Many successful applications 
are reported in Volume IV, but it can be argued that the polling of 
soldiers in an army atmosphere is not indicative of what will happen 
with other populations. 


After the answers to the questions have been obtained, the sam- 
ple coefficient of reproducibility must be determined. This is not a 
trivial problem since it is necessary to determine, on a more or less 
“cut and try” basis, that ordering of individuals and that ordering of 
questions which will give the highest reproducibility. Two chapters 
in Volume IV, prepared by Suchman, describe in great detail the use 
of a “scalogram” board for carrying out this operation. This is a 
very neat mechanical device for doing a difficult job, but its use is 
handicapped by the fact that only about 100 individuals can be treated 
at one time. A paper and pencil technique, which suffers from this 
same restriction as to size of sample, has been described by Gutt- 
man (4). Recently, Ford (3) has suggested a procedure which makes 
use of IBM punched card equipment and which therefore imposes no 
restrictions on the number of individuals used in the testing. The 
only troublesome feature of these approaches lies in the combination 
of categories of various questions. This is justified on the grounds 
that one cannot tell in advance which categories will actually be dis- 
tinct to the respondents. There does not seem to be any serious theo- 
retical difficulty inherent in this combination of categories, provided 
some of the general rules now to be set forth are observed. 

When the individuals are arranged in rank order, the preceding 
work gives rise to a pattern of responses and to a coefficient of repro- 
ducibility. From these two items of information, one must infer what 
would have happened had the entire population reacted to the entire 
universe of content. This inference, at the present time, rests mainly 
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on certain “rules of thumb” which appear adequate until more pre- 
cise rules can be determined. 

The primary concern is that the sample coefficient of reproduci- 
bility is not spuriously near .90—i.e., near .90 solely as a result of 
chance. In order to give an exact answer (in terms of probability) 
to this question, much more needs to be known about the entire pro- 
cess than is known at present, and in particular, sampling theory for 
the reproducibility coefficient is required. However, Vol. IV presents 
some rules to follow which will help out in this respect. These are: 

(1). A sample of at least 100 individuals should be used in test- 
ing the hypothesis of approximate scalability. 

(2) The more items included in the test, the greater is the as- 
surance that the universe is scalable. A sample of at least ten items 
is recommended for this task. 

(3) The more response categories for items included in a test, 
the greater is the assurance that the entire universe is scalable. 

(4) If all sample items have marginal distributions in the 
neighborhood of 80-20 or 90-10 splits (i.e., for dichotomous items), 
then spuriously high coefficients are likely to be obtained. Therefore, 
it is recommended that the sample should contain items having as 
wide a range of marginal distributions as possible, and specifically, it 
should include items with marginals around 50-50. Moreover, items 
should have at least 90 per cent reproducibility by themselves, or 
should contain ‘more non-error than error.” 

(5) The pattern of errors should be examined to see that there 
are no groupings of non-scale types. This problem of pattern has 
already been treated eariler in this review. 

These criteria are discussed and illustrated at some length in 
Vol. IV with both empirical data and hypothetical examples. Lacking 
a theory of sampling for the reproducibility coefficient, they are per- 
haps the best that one can do. Experience and research will tell more 
about this. 

(c) Measuring intensity and the zero point. It is shown em- 
pirically in Chapter 7 that intensity of feeling bears the U-shaped 
relationship to rank order that theory predicts for the second com- 
ponent. Consequently, the second component has been given the name 
of “intensity.” The low point of the “U” (or the zero point) is inter- 
preted as a cutting point on the scale to separate the ‘“favorables” 
from the “unfavorables.” Unfortunately, measurement techniques are 
not available for making intensity the perfect function of rank order 
that theory predicts. Much error is inherent in its measurement. 
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One cannot doubt the usefulness of a “zero point.” However, 
there are many questions concerning the interpretation of intensity 
as the second component. Is intensity really the second component or 
does it only appear so because of the large amount of error involved 
in its measurement? If intensity actually is the second component in 
some instances, is it always the second component? May not the iden- 
tity of the second component vary from one type of situation to an- 
other? Does the cutting point provided by the second component ac- 
tually divide the population of individuals into “favorable” and “un- 
favorable,” or must some other interpretation be given to this cutting 
point? The answers to these questions would seem to be dependent at 
least partially on finding improved methods of measuring intensity 
and on applying these and the present methods to a wide range of 
practical situations. 

(d) The incidence of scales. The amount of effort which should 
be expended on the unsolved problems of scale analysis must be deter- 
mined somewhat from the extent to which they are applicable in prac- 
tical problems. It is admitted in Volume IV that the occurrence of a 
scale is probably the exception rather than the rule. Many examples 
of scalable universes and populations are given, and many more ex- 
amples are cited where scalability was not found. At this stage in 
development, it would seem very worthwhile to make a survey of at- 
tempted applications and to summarize this information. This re- 
viewer would have liked very much to have included such a survey, 
but time was not available. 


4, The Model for Latent Structure Analysis and its Relationship 
to the Scale Model 


(a) Preliminary remarks. It would appear likely that any “un- 
initiated” person will be left with a feeling of frustration after a first 
reading of Chapters 10 and 11 on latent structure analysis. Such a 
situation will be unfortunate, but also perhaps unavoidable. The dif- 
ficulties stem from three facts. These are: 

1. Scale analysis attempts to pick out a particular pattern of 
responses (of people to questions) and to extract from this pattern 
of responses all of the pertinent information. The basic approach is 
simple, the ideas are simple, and much experience has been developed. 
On the other hand, latent structure analysis attempts to set up a model 
which will be applicable to a much wider range of response patterns 
and the underlying concepts are as a result more complex. 

2. Scale analysis, in theory and in practice, can be presented 
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with a minimum amount of mathematical analysis (except where 
principal components are involved) while latent structure analysis, 
once the underlying model has been set up, depends very heavily on 
mathematical formalization and its application requires a large 
amount of numerical computation. 

3. Scale analysis is spread over eight chapters, while latent 
structure theory is concentrated in two. This is a direct result of the 
fact that scale analysis developed and was applied over the entire 
period of World War II, whereas latent structure theory grew up ina 
short space of time and has not as yet had a great deal of practical 
application. 

The above points have been set forth for two reasons. First, it 
is to be hoped that a clear understanding of their implications will 
better enable readers of Volume IV to pick out the salient features 
of latent structure theory without being repelled by the complexity 
of its mathematical and computational aspects. In this respect it 
might be noted that a very well written, general account of latent 
structure theory is presented in Chapter 1 (by Stouffer). It is strong- 
ly recommended that a general reader examine this material very 
carefully before proceeding to Chapters 10 and 11. The second rea- 
son for making these observations is so that one can better realize the 
limitations which are imposed on a review such as the present one. 
The wide ramifications of latent structure theory simply will not ad- 
mit as clear an exposition as did the rather narrower concepts of scale 
analysis. 

(b) The model. Although it is not explicitly stated in Volume 
IV, and much confusion would be avoided if it were, latent structure 
theory starts at the same point as does scale theory. A population of 
individuals and a universe of questions (of common content) is as- 
sumed. Thus far the main body of theory has been developed for 
dichotomous questions. The reaction of each individual to each ques- 
tion is observed and the results are called the “manifest data.” One 
can now state (quoting from Chapter 1) that “The latent structure 
approach is a generalization of Spearman-Thurstone factor analysis. 
The basic postulate is that there exists a set of latent classes, such 
that the manifest relationship between any two or more items on @ 
questionnaire can be accounted for by the existence of these latent 
classes and by these alone. This implies that any item has two com- 
ponents—one of which is associated with latent classes and one of 
which is specific to the item. The specific component of any item is 
assumed to be independent of the latent classes and also independent 
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of the specific component of any other item.” Actually, at this stage 
it would be better to say not “items on a questionnaire,” but “items se- 
lected from the universe of content.” 

Before proceeding further, one can see at a glance that this pos- 
tulates a much more complex situation than did the Guttman criterion 
that responses should be reproducible from the ranks (i.e., of indi- 
viduals). Moreover, the parallelogram pattern of responses of scale 
analysis satisfies the latent structure postulate, the ranks defining 
thre classes of the structure. 

Latent structure theory attempts to explain the observed response 
pattern in terms of an underlying attribute. Or, to put it in different 
terms, it assumes that there is a variable (according to which each 
individual can theoretically be classified into one of a set of mutually 
exclusive classes) which accounts for all the observed relationships 
in the manifest data. These latent classes can be conceptualized in 
many ways. For example, Chapter 10 starts with a continuous vari- 
able, « , and derives the latent classes by partitioning the range over 
which «x varies. In this case, the latent classes have a natural order- 
ing; but in other cases, such a natural ordering may not be apparent 
from the model or from the computations made with the manifest 
data. 

In the discussion of scale theory it was observed that a scale could 
not define the content. It could only show that there was a single con- 
tent meaning, or a single dimension, existent in the individual ques- 
tion reaction. Just so, the existence of a set of latent classes cannot, 
by itself, provide a name for the latent attribute or for its classes. 
This situation is strictly analogous to that existing in factor analysis. 

The general model hypothesized by latent structure theory is as 
follows. Suppose that there is a latent attribute having 4 classes and 
that there are m dichotomous questions in the universe of content. 
Let ”,;, %1,--- , mx be the number of people falling in the respective 
classes of the attribute. Furthermore, let »;; be the proportion of 
people in the ith class (i=, II ,---, 4) who answer “Yes” to the jth 


question (j= 1,2,---, m). In tabular form, this is 
Latent Class 
Latent Class Frequency Items 
1 2 bens m 


I ny Pin Pre Primm 
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The total number of individuals represented by this table is n (i.e, 
Ny + My + +++ + nm =n). The requirement that all relationships 
among the items be accounted for by the latent classes alone can be 
translated into the statement that items should be independent with- 
in classes. This leads to a set of equations of which the following two 


equations are examples: 


Ny = 1 Pr Pro + Ny Pin Piz + +++ + MA Pr Pr , 
Ny23 = Ny Pr Pre Pig + Ny Pin Pie Pus 
Sie bie oe Nx Pri Pr Pars . 


M2 is the number of persons giving ‘‘Yes” responses to both items 1 
and 2, and 2.3 is the number of persons giving “Yes” responses to 
all three items 1, 2, and 3. These, and the other equations which must 
hold, state that the responses of people falling in the same class of 
the latent attribute to any set of questions selected from the universe 
of content must show independence between the questions. Lazarsfeld 
puts this intuitively by saying that the latent attribute is the only 
thing which holds the questions together. In practice, one will not 
know that a latent attribute exists; and if it does exist, one will not 
know the latent structure parameters ”,;, My, ,°-*:, 4 Pn, Dio, *** 
Pin, Pro, ***» Pra, *** » Dam. These parameters must be determined 
from the manifest responses, once it has been ascertained that the 
latent structure model holds. Further discussion of these points will 
be given under the section on application. 


(c) Deductions following from the latent structure model. As- 
suming that the latent structure model is satisfied by a population of 
individuals and by a universe of content, one may well ask what use- 
ful deductions follow. Perhaps the foremost problem in this respect 
is that of sampling, for no practical applications can be made without 
induction from a sample. The general aspects of this situation are 
barely mentioned in Volume IV. However, the following comments 
can be made. If the entire population of individuals is used, then no 
matter how a sample of questions is chosen from the universe of con- 
tent, the latent structure model must be satisfied. Furthermore, the 
same type of statement seems to hold true for the sampling of indi- 
viduals. Note, however, that the structure parameters 7, , m1 ,-+: , % 
(or more appropriately, when a sample is being considered, ,/n. 
Ny/n, +++ ) will depend very markedly on the way the population of 
individuals is sampled. In other words, a “good” sample of individ- 
uals is required. This is not essentially different from the determina- 
tion of the proportion of people having various scale ranks as dis- 
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cussed in the portion of this review devoted to scale theory. 

The structure parameter p;; was defined as the proportion of 
people in the ith class who answer “Yes” to the jth question. An 
alternative definition might be: p;; is the probability that a person 
of class 7 will answer “Yes” to the jth question. The use of this lat- 
ter definition would essentially introduce the idea of test-retest reli- 
ability into the model. Still a third interpretation might be given to 
pi; by assuming that it is the average probability that a person in 
class i will answer “Yes” to the jth question. Some thought should 
be given as to the relative merits of these three alternative definitions, 
and to the changes which they might introduce into the latent struc- 
ture theory. 

At the present time, the latent structure model would seem to 
be most useful where it provides a natural means of ranking latent 
classes and response patterns. The simplest case of this is that of a 
latent dichotomy (i.e., each individual either possesses the attribute 
or does not possess the attribute). In this instance, response cate- 
gories can be ranked according to the proportion of individuals in 
that category who possess the attribute. For example, suppose that 
there are three items in the universe which give rise to a latent di- 
chotomy. Then it is possible to compute for each of the eight response 
patterns (+++, +4+—, —++, +—+, +—, —t+-—, —t, --——), 
where a “+” signifies a “Yes” and a “—” signifies a “No,” the pro- 
portion of individuals having that pattern who possess the attri- 
bute. This provides a means of ranking response patterns (except, 
of course, that one cannot tell within a specified response pattern 
which individuals possess the attribute and which do not possess it). 

Certain other latent structures which have this special order fea- 
ture are picked out for special attention in Chapters 10 and 11. For 
example, a Guttman scale has the following structure: 


GUTTMAN SCALE 





Latent Class 
Frequency Items 
1 2 38 4 5 m 
ny 1 1 1 1 1 1 
Nyy 0 1 1 1 1 1 
~<a > + 2 2 & 1 
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A quasi-scale would seem to have the following latent structure: 


QUASI-SCALE 





Latent Class Items 
Frequency 1 2 3 ae m 
Ny 1—a, 1—a, 1—a, eee 1—a,, 
Ny a, 1—a, 1—a, sists 1—a,, 
Nyy a, a, 1—a, see 1—a,, 
ny a, a, a, ss an, 


In this case it is assumed that a, , a2, +--+, Gm are “small” probabilities. 

The concept of a natural ordering is generalized by Lazarsfeld 
to cases which he defines as possessing latent linearity. However, there 
is neither time nor space available for discussing this in detail. At 
the present time it would appear to be difficult to talk meaningfully 
about any kind of latent structure which does not possess some kind 
of natural ordering. 

Although there are many other aspects of latent structure analy- 
sis which are discussed in Chapters 10 and 11—pertaining to the ex- 
istence of more than one latent set of classes and to the characteristics 
of various types of questions—these cannot be touched upon in this 
review. Perhaps more important than those specific points, however, 
is the general comparison of scale analysis and of latent structure 
theory. The fact that the scale pattern appears as a special case of 
the latent structure theory does not at all detract from the useful- 
ness of scale analysis. Provided that the hypothesis of scalability is 
satisfied, there appears to be more intuitive meaning which can be 
attached to the relationship between population and universe of con- 
tent than will ever be possible in the more complex models. Moreover, 
the internal analyses possible for the scale pattern (following from 
the theory of principal components), have not as yet been generalized 
to the other types of response patterns. Finally, as will be apparent 
in the next section, the practical application of scale theory is much 
simpler than that of latent structure theory. It is naturally under- 
stood that both of these models may have to be changed considerably 
as experience develops. It is difficult to see how a model can remain 
static in a field as complex as that of attitude measurement. 
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5. The Practical Application of Latent Structure Theory 


(a) The choice and determination of a particular latent struc- 
ture. Even if one is given a population of individuals and a universe 
of questions which are known to satisfy the latent structure model, 
the problem still remains of actually spelling out the latent structure. 
In particular, it is vitally necessary to specify the number of classes 
which the latent attribute possesses. Provided that there exists such 
a perfect latent structure, Lazarsfeld has devised a criterion to tell 
how many of these classes there must be. Unfortunately, this cri- 
terion seems to be of theoretical interest only and of little practical 
importance. The practical difiiculties are attributable to two princi- 
pal sources. In the first place, perfect latent structures are no more 
likely to exist than are perfect scales; and in the second place, for 
any sizeable universe of questions, the computations would be well- 
nigh hopeless. 

The foregoing points mean that a latent structure must be fitted 
to a population and universe of questions on a more or less “cut and 
try” basis. In other words, a value of 2 (number of latent classes) is 
assumed and one proceeds to determine a “best fitting” structure for 
the manifest data. That is, “best fitting” values of n;, M1,---, ni, 
Pn »***, Pim, Pin, *** » Pim are computed. From these latent structure 
parameters it is possible to predict the number of people who should 
fall in the various response patterns determined by the questions. 
Comparisons between the observed and fitted frequencies then gives a 
measure of fit. This, of course, immediately raises the problem of 
what constitutes a reasonable degree of fit. Although the problem is 
recognized and discussed, Volume IV offers very little guidance in 
this respect, and serious thought needs to be given in this phase of 
latent structure analysis. In particular, it seems that more precise 
statements concerning the uses which are to be made of the latent 
structure must be given before questions relating to “reasonableness 
of fit” can be answered. In the event that the classes of the latent 
attribute possess a natural ordering, the answer seems to be that a 
hypothesis of unidimensionality is being tested. If such natural or- 
dering does not exist, the question still remains open. 

It should be noted that a measure of fit derived in the above man- 
ner does not correspond to the coefficient of reproducibility used 
in scale analysis. Such a measure of fit is based upon a consideration 
of individuals whereas the coefficient of reproducibility is based upon 
a consideration of responses. In a crude sort of way, the correspond- 
ing value for scale analysis would be determined from the number of 











a te ggher 


ne NRE INT LR 


i Dy - 





cere ess 


RT: pi 33 FEO 


ET ae 


oe 


i 








PHILIP J. McCARTHY 267 


non-perfect scale individuals. 

In addition to the above mentioned considerations that are in- 
volved in arriving at a “best fitting” structure, there are two other 
points which can cause no end of difficulty. A tentative value of 4 
must be obtained and the structure parameters must be computed. At 
the present time, the determination of 4 would seem to rest upon more 
or less intuitive and empirical grounds. Limited experience with War 
Department data has led to the development of certain rather gen- 
eral observations concerning the relationship between question form 
and question content and reasonable values of 4. Furthermore, cer- 
tain structures are not computationally feasible. This means that 
attention must usually be focussed upon a restricted set of latent 
structures (e.g., the latent dichotomy, or the quasi-scale type of struc- 
ture). Fortunately, those structures for which computations are pos- 
sible are most likely to be those for which a natural ordering of the 
latent classes is provided. This may have a salutary effect in that 
effort will not be wasted, for a time at least, in deriving latent struc- 
tures with which no particular meaning can be associated. 


(b) Testing the hypothesis of an approximate latent structure. 
In order to apply latent structure theory to a specific situation, it is 
necessary to go through about the same steps as for the application 
of scale analysis. A population of individuals and a universe of ques- 
tions are defined; a sample is selected from each and the responses 
(i.e., the manifest data) are obtained; a specific latent structure is 
postulated from past experience, knowledge of question form and con- 
tent, and from consideration of what is computationally feasible; the 
fit of this structure to the date is evaluated; the results of the fitting 
procedure are used to infer something about the corresponding rela- 
tionship in the entire population and universe; and finally, if the fit 
is too “bad,” either a more complex structure if fitted or else the hy- 
pothesis of an approximate latent structure is rejected. 

Most of these steps require no further discussion (e.g., the sam- 
pling of individuals), or the comments made in the preceding subsec- 
tion will hold (e.g., assumption of a structure for which the compu- 
tations can be carried out, necessity for having some form of natural 
ordering in order to provide meaning for the fitted structure, and so 
on). However, there is one extremely important point which does not 
seem to have received adequate attention, at least insofar as the ma- 
terial in Volume IV is concerned. This is the problem of question 
sampling. For example, suppose that a latent structure with three 
classes is fitted to the responses of the entire population on, say, four 
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questions (e.g., assume that ,, Ny, Min, Pn, °** » Dis, Pin, *** » Pras 
Pun, ***, Pum are determined) and that the fit is “reasonable.” If 
several new questions are added, say questions 5 and 6, and the same 
structure if fitted to all six questions (the four old ones and the two 
new ones), how will the new structure parameters compare with the 
old ones? 

There is no reason to expect that the answers to questions of 
this type will be simple. As a matter of fact, the answers must be, 
by the very nature of the problem, extremely complex. However, it 
would seem that some progress might be made on a purely logical 
basis, to say nothing of the empirical studies that could be carried 
out. The need for this material is well illustrated by the following 
quotation taken from Chapter 1: “It should be said that certain of 
the problems of indeterminancy which haunt quantitative factor 
analysis appear in latent attribute analysis. Work with this new tool 
is still too young to have developed a completely standardized set of 
criteria for determining when the approximations are ‘good enough’ 
approximations. Such standards, doubtless, will be forthcoming but 
they wait upon a large amount of further empirical and theoretical 
investigation. For example, two items which seem by a priori in- 
spection of content to be in the same general attitude area as the four 
items in our illustrative example were substituted for items 2 and 4. 
Although by conventional item analysis techniques, these new items 
should belong in the scale, they did not satisfy the criteria for the 
latent dichotomy model.” The example referred to involved two latent 
classes and four dichotomous questions. 

This discussion is not meant to imply that these points are not 
recognized in Chapters 10 and 11. They are recognized. However, 
one is left with the general impression that too much emphasis has 
been placed on the purely formal aspects of fitting a latent structure. 
This is not at all surprising and comes as a direct result of the re- 
cency of latent structure theory, the complexity and generality of the 
model, and the heavy analytic problems involved. More thought and 
research needs to be devoted in the near future to the problems of 
question-sampling and to the real meaning of the simpler forms of 
the model. 


6. Summary 
The principal points brought out in the course of this review 
are: 
(1) Scale analysis and latent structure analysis provide models 
and objective procedures for approaching the problem of attitude 
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measurement. However, neither of them defines an attitude. Both 
deserve careful attention from a logical and from an empirical point 
of view. Such an examination may bring about changes in the models, 
but this is the only way in which progress can be made. 

(2) Although the scale model can be viewed as a special case 
of the latent structure model, there would seem to be more intuitive 
meaning associated with a scale than with the more general case. 
Moreover, certain internal analyses can be made within the scale 
model which have not as yet been carried over to the other. The prac- 
tical application of scale analysis is less troublesome than that of 
latent structure theory, but it may be applicable in a much narrower 
range of cases. 

(3) Perhaps the single most important problem, and this is 
common to both models, is that of defining the universe of content (or 
of questions), sampling from this universe, and inferring from these 
results something about the entire universe of content. 

(4) As far as the contents of Volume IV are concerned, latent 
structure theory suffers in contrast with scale analysis. In particu- 
lar, less space is devoted to latent structure theory, a body of experi- 
ence has not yet grown up from its application, considerable mathe- 
matical analysis and numerical computation are needed, and not 
enough thought has yet been given to the problems arising out of the 
latent structure model. It should be recognized that these facts are 
due substantially to the recency of the latent structure model, and 
they should not lead research workers to give latent structure analy- 
sis less attention than it deserves. 
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An extension of Dwyer’s “square root” method has been made 
to the problem of selecting a minimum set of variables in a multiple 
regression problem. The square root method of selection differs from 
the Wherry-Doolittle method primarily in that (1) the computations 
required are more compact, (2) an F' ratio criterion is used which 
leads to the selection of fewer variables. The method provides so- 
lutions for the problems of test selection, item analysis, analysis 
of variance with disproportionate frequencies, and other problems 
requiring the rejection of superfluous variables. In a subsequent 
article a worked example will be given, and the square root and 
Wherry-Doolittle methods compared. 


I. Introduction. 


The purpose of the present paper is to introduce a compact pro- 
cedure for selecting the minimum number of effective independent 
variables in a multiple regression problem. The problem has been 
previously discussed by Wherry (24), who developed a technique 
based on the Doolittle method. The procedure presented here is based 
on what Dwyer (4) has called the square root method of computing 
multiple correlations and regressions. 

R. L. Thorndike (19, pp. 201-202), in addition to giving an ad- 
mirable summary of the problem of selecting a battery of tests, has 
given a description of Wherry’s method which is followed below. With 
some modifications the description applies to the square root selection 
method proposed here. Wherry’s technique permits the successive 
addition of variables. Starting with the most valid single test, the 
second variable which will add most to the validity of the first is 
found. This involves determining the partial correlations of each 
test with the criterion when the first test is held constant. The regres- 
sion weights and multiple correlation for the two tests are found, and 
a third test sought. The process is continued step by step; at each 
step the test that will give the greatest increase in the multiple corre- 
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lation is added. 

Although the square root method follows this description on the 
whole, it differs from the Wherry-Doolittle method in several respects 
which we consider crucial. 

(1) The multiple R? and beta weights for the whole battery of 
tests are computed as a first step. Whether any test selection at all 
is justified, can be determined by testing this multiple R? for signifi- 
cance. The beta weights serve as a partial check on the final test 
selection. In general, the selected tests can be expected to include all 
tests having significant beta weights, but other tests may also be se- 
lected. 

(2) The decision procedure for ending the selection of addi- 
tional tests depends on the use of the well known F ratio criterion 
(16, p. 266 and 23, p. 262). F ratio tests are made to determine 
whether (a) the increase in the multiple R? introduced by an addi- 
tional variable is greater than chance and (b) the multiple R? for the 
selected variables is significantly lower than the value obtained using 
all the tests in the battery. Wherry’s decision procedure is shown to 
be too weak ; it leads to the selection of mcre tests than are necessary.* 

(3) The computational procedure is based on Dwyer’s square 
root method rather than the Doolittle method. The computations are 
more compact, the coefficients easier to interpret, and the tabulations 
fewer, than for Wherry’s method. The authors agree with Dwyer 
that the square root technique is an aid “toward fewer errors, less 
computing time and more pleasure for the computor” (4, p. 493). 
The method is particularly advantageous if automatic desk calcula- 
tors are used. 


II. Relation of the square root method to factor analysis, partial and 
multiple correlation. 


The square root method is essentially similar to other regression 
methods in that it involves the reduction of the correlation matrix, R , 
of the » independent variables, to an n X n triangular matrix, T, a 
matrix with all its elements above the principal diagonal equal to zero. 
The distinctive feature of the method is the direct calculation of a 
triangular matrix which has the property 

*Professor Burt, in his laboratory notes (1942), has suggested a method of 
test selection based on what he terms the “method of hierarchical subtraction,” 
rather than on the “square root method.” Use of the F’ ratio for testing the sig- 
nificance of the increase (2,a, above) is mentioned. An account of this technique 


has been published recently. (Burt, Cyril. The numerical solution of linear equa- 
tions. Brit. J. Psychol., statist. Sect., 1951, 4, 31-54). 

















A. SUMMERFIELD AND A. LUBIN Zits 


TT’ =R. (1) 


Factor analysts will at once recognize expression (1) as a particular 
case of the fundamental theorem of factor analysis embodied in the 
more general expression, e.g., Thurstone (21, p. 66 and 22, p. 78), 


FF’'=R (2) 


where F is in general, an n X r matrix of factor saturations, r be- 
ing the rank of R. 

Thurstone (29, and 22, p. 101) has given a procedure, which 
he called the “diagonal method of factoring,” for deriving a triangu- 
lar (or trapezium) matrix from the correlation matrix. It is identi- 
cal with Dwyer’s procedure. Thurstone, of course, derives his tri- 
angular matrix from a correlation matrix having communalities in 
the principal diagonal, as he wishes to treat it as a matrix of com- 
mon factors. The use of such a procedure in factor analysis was also 
suggested by Burt (1, footnote to p. 307). 

The matrix T is therefore a triangular factor matrix obtained 
by the square root method from a correlation matrix with unities in 
the principal diagonal. We shall refer to it as the square root matrix 
in the context of this article. To call it a “diagonal” matrix would 
be misleading, as in matrix algebra this would mean a square matrix 
where “all the elements other than those in the principal diagonal are 
zero.” (6, p. 3) And to use the term “triangular” matrix would not 
differentiate it from other triangular matrices derived from FR, that 
would not satisfy equation (1). The term ’’square root” was first used 
to describe this triangular factor matrix by Dwyer (4). Holzinger 
and Harman (11, p. 94) in discussing the same procedure, state that 
‘This solution is obtainable by means of a general algebraic proce- 
dure for factoring any symmetric matrix, known as “completing the 
square.” The method was applied specifically to a correlation matrix 
by McMahon [(15)] before 1923.’ Holzinger (10) refers to it as the 
“solid staircase” method. 

In interpreting a square root matrix, e.g., Table 1, as a factor 
matrix, the columns represent orthogonal factors, and the rows rep- 
resent the projections of each test vector on the orthogonal axes. These 
projections are the correlations of each test with the orthogonal fac- 
tors and therefore represent the regression coefficients for obtaining 
the test scores from the factor scores. Since each test vector has a 
length of unity, the factor saturations are actual correlations and the 
test scores can be obtained exactly from the factor scores. 

The essential aspects of the square root method can be simply 
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described. A first axis, factor, or reference vector is identified with 
one of the variables. A second variable is taken and a second factor, 
orthogonal to the first, is derived to account for the residual variance 
of the second variable after the extraction of its communality in fac- 
tor I. It follows that the first variable must have a zero loading on 
factor II. A third factor is made to account for the residual variance 
of a third variable, after the removal of its communality in factors 
I and II. Both first and second variables must now have zero corre- 
lations with factor III. A fourth variable is next taken and treated 
similarly, and so on. 

It is very useful to think of the coetficients of the square root 
matrix as correlation coefficients. The first column in Table 1 repre- 
sents the correlations between each of the tests and the first factor, 
which is simply the first test. The general element of this column is 
Ti, where 7 is any test. 

The second column in Table 1 gives the correlations of each test 
with the 2nd factor, X.,,. X21 is the score on the second test when 
the linear effect of the first test is held constant. It is defined by the 
following equation: 


Xo. = X2— (dei + bo,X3) (3) 


where a», + b2,X, is the least-squares prediction of X. based on X,. 

The general element of the second column may be called 7,2.) ; 
the correlation of the ith test with X,,, that part of the X2 score 
which can not be predicted by X,. Professor Burt has suggested that 
these coefficients are equivalent to what are known as semi-partial 
correlations, and we shall refer to them as such; the reader should 
note that they are not the usual partial correlations. 

The general element of the third column of Table 1 is 7i;3.1,2) , the 
correlation of the ith test with X;.,., that part of each X,; score that 
cannot be predicted from X, and X2. 

Obviously 71:21) must be zero since X, can only correlate zero 
with a score from which all linear influence of X, has been sub- 


tracted. Similarly, 
13 (3.1,2) == 12(8.1,2) — 9. (4) 


The nth column of Table 1 consists entirely of zeros except for 


Tn (n.1,2,3, +**,N-1) and Te(m.1,2,8, +++n-1) ° 


The last row of Table 1 is the set of correlations of the criterion 
test, c, with the n orthogonal factor scores. The first columns rep- 
resent a set of orthogonal scores which are capable of reproducing 
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the test intercorrelations exactly. The squared multiple correlation of 
the criterion, c , with the n orthogonal factor scores, R*.(n) , is equal 
to the sum of the n squared correlations in row c. Now, the n test 
scores can be reproduced exactly by the nm factor scores. Therefore, 
the multiple R* of test c with the n test scores is exactiy equal to 
R®..n), i.e., to the squared multiple correlation of the criterion with 
the n orthogonal factor scores. Moreover, since the first 7 tests are 
completely dependent on the first i factor scores, the squared multiple 
correlation of the criterion with the first 7 tests, R*.;;, , is the sum of 
squares of the first i correlations in row c. Evidently, 77¢;i+1.1,2,s, ...,i) 18 
the additional contribution to this squared multiple correlation made 
by the (i + 1) th test. 

In general, the basic theorem used throughout this article is 
“Given the intercorrelations of n variables, the squared multiple cor- 
relation of the ith variable with & other tests, is equal to the com- 
munality of the ith variable on the k factors which reproduce the k 
other tests.” This theorem holds because each test vector is taken to 
be of unit length, with no specifics. 

General theorems of this type connecting multiple correlation 
with factor analysis, have been given by Roff (18), Dwyer (2, 3), 
and Guttman (8). Computing formulas, for both partial and multi- 
ple correlations have been given by Dwyer (3). 


The equations for calculating the beta weights 

Dwyer (4, pp. 497-499) points out that multiple regression 
weights for each of the variables can be calculated directly from a 
square root matrix such as Table 1. In essence, a “back solution” is 
made for the beta weights similar to the Doolittle procedure. We 
give a matrix algebra proof and then outline the technique in some- 
‘what more detail than is given by Dwyer. 

Let R be the x X n correlation matrix of the n independent vari- 
ables. B is the  X 1 column vector of beta weights whose general 
element of the jth row is §;. Let 7. be the x X 1 column vector of 
correlations of the criterion with the n independent variables. The 
element in the jth row of 7, will be denoted by 7;-. 

The usual “normal” equations can be written; 


RB=*”.. (5) 
From equation (1), TT’ = RFR, so if we substitute for R we get 
TTB=*. (6) 


If equation (6) is premultiplied by 7-1, then 
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7’B= Tr. (7) 
The column vector 7-7, represents the correlations of the cri- 
terion with the n orthogonal factors of T. This same set of correla- 
tions is shown as the row of correlations for the criterion variable in 
Table 1. 

The column vectors of T in Table 1 are denoted by ¢, and ¢; will 
be the jth column of 7. The element of the ith row and jth column in 
T will be denoted by ¢;;. Equation (7) implies that each of the col- 
umn vectors of T when multiplied by the beta weights, will yield the 
semi-partial correlation of the criterion with the orthogonal factor 
represented by that column vector. In summation notation 


B’ t; = = Bi ti; = Te(j.1,2,*+,f-1) « (8) 
4-1 


Working backwards, we first solve for £,, the last beta weight. 
We use t, of Table 1, which has zeros everywhere except in the nth 
row and in the row for the criterion. Equation (8) when applied to ¢t, 
gives an equation with one unknown, 


Ba nin.3;9,-,0-0) = No(n.1,2,++,n-1) « (9) 
In terms of the elements of the t, vector, 
Bn tan = ten » (10) 


The solution for #, is simply 


Te (n.1,2,+++,n-1) 
Bn = ben/ tan = —_- (11) 


Tn (n.1,2,+++,n-1) 
Having obtained f,, the next weight, 8,-. can now be found by 
using equation (8) with t,... There are only three non-zero entries 
in this column. Equation (8) is 


Bn tnn-2 + Bu-  (m-1) (n-1.1,2,-++,n-2) = Te(n-1.1,2,--+;n-2) ° (12) 


Since #, is known numerically from equation (11) there is only one 
unknown, f,-. and we can solve for it. In exactly the same fashion, 
all of the other multiple regression weights can be found. 


It is recommended that the computation 


PP o35~0 — p B; Tic (13) 
4=1 


should always be carried out. It is a useful, though not infallible check 
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on the computation of the beta weights. 


Ill. The F ratio criterion for determining wh°x additional variables 
add significantly to the multiple correlation. 


Various methods of systematic test selection have been published ; 
Dwyer (3), Horst and Smith (12), Johnson (13), Wherry (24): in 
none of these cases has an F ratio criterion been used to decide when 
to terminate the selection of variables.* We propose to employ the fol- 
lowing well known test of significance. 

Let R?, be the squared multiple correlation based on n variables. 
Let R?; be the squared multiple correlation based on a sub-set, 7, of 
the 7 variables, (i < ”). To test the significance of the difference be- 
tween R*, and R?; we take 


(R*, — R*;) /(n—1) 


F= , (14) 
(1— R?,)/(N —n—1) 





with degrees of freedom, n, = n —iandn,.=—=N—n-—1. 

The foregoing paragraph is paraphrased from McNemar (16, p. 
266). A general proof that this F ratio is a likelihood ratio criterion 
is given by H. B. Mann (17, Ch. IV). 

In the case of the square root method, when n = 7 + 1, ie., we 
wish to test the contribution of the (7 + 1)th variable to R*;, then 
(R?;,., — R?;) is equal to the squared correlation of the criterion with 
the (7 + 1)th factor. (See Table 1.) 

The F ratio is used here primarily as a criterion in a decision 
procedure, not as a test of significance giving exact probability levels. 
The difficulties of interpreting significance levels when F ratio tests 
are made sequentially are discussed in section V. 


IV. The square root method of selecting effective independent vari- 
ables. 


The method of selecting variables that we are proposing follows 
at once from the properties the square root matrix has been shown to 
possess. The multiple R* of the criterion variable with the complete 
battery is determined. The hypothesis that this coefficient does not 
differ significantly from zero is tested to ensure that the battery pos- 
sesses predictive power. The independent variable having the high- 


*Professor Burt’s method includes the use of this procedure. (See P. E. Ver- 
non’s Notes on Statistical Methods, 1945, pp. 6-11, where it is described in full 
with an illustrative example. 























A. SUMMERFIELD AND A. LUBIN 279 


est validity is then selected, X 2, say. Its column of correlations in 
R is made the first column, t,, of T. The multiple R? thus far is 


i. =... (15) 


c.i*% ci* 
As before, X, is the criterion variable. 


The semi-partial correlations of all (n — 1) remaining variables 
with X, are now calculated. The set of quantities, Tiiey? GH1,2, 
--+,U—-1,7*+1,---,;7 #7), is thus obtained. As has been shown, 
the squares of these quantities are the contributions to the multiple 
R? that each of the remaining independent variables can make individ- 
ually, the variable X,, having been selected. The highest value of 
, given by variable X,, Say, is taken. Hence 

R? =f +7 


c.4*,j* ci* c(j*.i*) ° 


Pas 
c(j.i*) 


(16) 

The difference between (15) and (16) is tested by (vide section III), 
Vesey ® 

~ (1—R,, . )/(N—2)’ 


If the F ratio is found to be significant , X , becomes the second vari- 
able selected. Its column in the (n+1)X(n+1) correlation matrix is 
used to compute the second column, ¢, of T. 

The semi-partial correlations, 7,, ;, ;.. , of the remaining n — 2 
independent variables, (k = 1,2,---,#—1, #+1,---,f*—1, #*+1, 
--+,n;k #7, 9*), are next calculated. The highest value, that for 
variable X,. , say, is taken, giving 


; a a +r ‘ (18) 


c.i*,7*,k* ci* c(j*.4*) c(k*.i*,j*) 


The difference between (16) and (18) is tested, as before, and if the 
increment due to X,,is found significant, X,,, becomes the third vari- 
able selected, t, is calculated; and so on. 

When a stage is reached at which none of the remaining inde- 
pendent variables makes a further significant contribution to the mul- 
tiple R?, the process terminates. It remains possible that if one or 
more of the variables not so far selected exerts a suppressor effect, 
then some pair or higher-order combination of the remaining vari- 
ables might together add significantly to the multiple R?. An advan- 
tage of the method is that it permits a test to be made against this 
contingency. Suppose that the selection process has been terminated 
according to the criterion just given, when p variables have in fact 
been selected which jointly yield a squared multiple correlation of 


[d.f.: 2m, =1,n,=N—2]. (17) 





F 
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R*..~»,. The multiple correlation of the criterion variable, X,, with 
the whole battery has already been calculated (vide first paragraph 
of this section). If it is shown that R*...», does not differ significantly 
from R?,.;,2,....» then it follows that none of the n — p remaining vari- 
ables, neither singly nor jointly in any combination, can add any fur- 
ther significant increment to R*,.;»,. The test is: 


os (Bosc. = R?..¢p)) / (n— Pp) 
(1 — Reex2,.n)/(N—n—1) | 


with degrees of freedom, n, = n — p and nm = N—n-—1. Should 
the test indicate a significant difference, it would be necessary to 
search for its source; otherwise it would be concluded that the p 
variables selected were an optimum set of effective independent vari- 
ables. The selection process should stop only when the multiple R? 
based on the selected variables does not differ significantly from the 
R? based on the whole battery. In actual applications of the method 
for test selection it has not so far been found necessary to select more 
than four variables. For example, in a student selection study by 
Himmelweit and Summerfield (9) out of a battery of twenty-one 
tests only four were selected. 


7 





(19) 


V. The merits and demerits of the square root method of test selec- 
tion. 


The selected tests will generally have the highest beta weights. 
The beta weight, roughly speaking, measures the contribution of a 
variable to the multiple correlation, independently of all other predic- 
tor variables. Therefore, those variables with large beta weights in 
the complete battery are most likely to be selected for the reduced 
battery. But variables whose beta weights do not differ significantly 
from zero may, when selected, make a significant contribution. Their 
near-zero regression coefficient may arise from the fact that some 
other variables measure very nearly the same thing. Fisher (5, p. 
422) cites a study where 


. .. when seven sea-level characteristics were employed in the prediction, 
not one of the coefficients was significant, although an apparently good 
prediction was obtained from the multiple regression formula. All that 
the non-significance meant, however, was that if any one of the coeffici- 
ents were given the value zero and the other coefficients adjusted, the 
prediction formula was not significantly impaired. The sea-level char- 
acteristics showed, in fact, sufficiently close mutual correlation for any 
one of them to be capable of replacement by an appropriate linear func- 
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tion of the others, so as to compensate nearly completely for its absence 
from the prediction formula. 

The square root method of selection is less likely than the 
Wherry-Doolittle method to terminate with variables having non-sig- 
nificant beta weights. Of course, any method which selected the vari- 
ables making the highest contribution to the multiple correlation and 
which used the F' ratio tests that we advocate would have the same 
result. There is some possibility that one of the earlier variables se- 
lected will be highly correlated with later variables, but we suggest 
that this would be quite rare. The method would therefore seem to 
be a solution to Ragnar Frisch’s problem of “nonsense multiple corre- 
lations” (7). But even if the square root method were always to give 
significant beta weights, it does not follow that the “best” solution has 
been found. 

The “best” solution could be found if we were to proceed as fol- 
lows: 

(1) calculate R?,,) based on all ” variables; 

(2) calculate R*,,, for each single variable and test each of 
these coefficients for significance against R* (») ; 





n(n—1) 
(3) calculate R*,., for each of the ——_ pairs of variables 
and test each of the coefficients for significance against R?(n); 
n(n—1) (n—2)_ ; 
(4) calculate R?,;) for each of the - triads of vari- 


ables and test each coefficient for significance against R?,n); and so on, 
until a value of R*,,, not significantly different from R?(,) was ob- 
tained. If, then, the null hypothesis were accepted for, let us say, a 
particular group of q variables, then this group would necessarily be 
the “best” set of variables to take. For all other sets of qg or fewer 
variables would have been tested and found wanting. 

Now it should be clear that although the “best” set of variables, 
q, may involve less than the number of variables, say p, selected by 
the square root method, p will never be less than qg. This must be the 
case since the square root method ensures that the R*,,) , based on the 
p selected variables, can never differ significantly from R?(n) . 

The square root method of selection is a special case of what Dr. 
M. P. Schiitzenberger has called a “one-step locally best” solution to 
a maximisation demanding a finite sequential procedure. Dr. Schiit- 
zenberger (personal communication) has given a general theorem to 
the effect that when the steps are independent of one another, the 
“best” solution is the “one-step locally best’ solution. In our case, he 
says, this means that when all tests have zero intercorrelations, the 
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square root method gives the best possible solution, but it may also 
give the best possible solution in other cases as well. 

The selection method discussed here can be extended to such 
problems as item analysis and analysis of variance with dispropor- 
tionate frequencies, to reject superfluous explanatory variables. In 
connection with the analysis of variance problem, James Durbin (per- 
sonal communication) has pointed out that: 


. .. the procedure of selecting the explanatory variable which gives the 
largest value of F alters the significance level of the test. 

Suppose for simplicity that the null hypothesis is known to be true. 
Then if the p values of F’ were independent of each other, the probabil- 
ity that none would be significant at the 5% level would be 


(a) 


Thus the probability that at least one value appeared significant would be 


1\ 
1— 1—— + 
20 


This is greater than 1/20 if p > 1. In fact the values of F would not 
be independent but would be correlated, though not perfectly. Thus the 
effective significance level used would be somewhere between 1/20 and 


1 \p 
1—{ 1—— ° 
As 
In the non-null case the complications are likely to be even greater. 


(See also Kendall on “Use of the z test for several variance ratios,” 
14, p. 199 ff.) 

The foregoing criticism applies equally to the present case of 
multiple regression.. Nevertheless, we suggest that the F ratio test 
provides the best available statistical decision procedure. However, 
the tabled levels of significance of F evidently do not apply precisely. 
It is clear that the square root method tends to select rather too few 
tests whereas the Wherry-Doolittle leads to the selection of too many. 

Another defect of the square root method (and all other selection 
procedures) is that the method capitalises sampling errors. It is rec- 
ommended that the population estimate of the squared multiple corre- 
lation be calculated as the “shrunken” R?,,,, where all n variables 
are used in calculating the shrinkage. This is, of course, the estimate 
against which we have compared multiple correlations based on se- 
lected variables. 
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EFFECT OF GROUP HETEROGENEITY ON 
ITEM PARAMETERS* 


HAROLD GULLIKSEN 
EDUCATIONAL TESTING SERVICE 


Most indexes of item validity and difficulty vary systemati- 
cally with changes in the mean and variance of the group. For- 
mulas are presented showing how certain item parameters will vary 
with these alterations in group mean and variance. Item parameters 
are also suggested which should remain invariant under such 
changes. These parameters are developed under two different as- 
sumptions: first, the assumption that the total distribution of the 
item ability variable is normal, and, second, that the distribution of 
the item ability variable for each array of the explicit selection vari- 
able is normal. 


Most indexes of item validity or item difficulty vary systemati- 
cally with the ability level of the group and with the variance in the 
ability of the group. Thus, the “per cent of persons answering an 
item correctly” increases as the mean ability of the group increases 
and decreases as the mean ability of the group decreases. The “item- 
criterion correlation” increases and decreases as the variance of the 
group increases and decreases. Item parameters which did not vary 
systematically as the mean and variance of the group ability changed 
would be valuable. Lacking such parameters it would be valuable to 
have formulas indicating the amount of change in a given item para- 
meter to be expected for a given change in group mean and variance. 

This paper will present the derivation of formulas which show 
the amount of change in various item parameters which should occur 
because of changes in group heterogeneity and also will develop item 
parameters which do not systematically increase or decrease as the 
heterogeneity of the group changes. 

We will consider two groups of persons. An unselected group 
with a specified mean and variance and a selected group with a dif- 
ferent mean (usually though not necessarily higher) and a different 
variance (usually though not necessarily smaller). It should also be 
noted that the theory to be developed applies regardless of the direc- 
tion of the change in mean and variance. Thus, rather than utilizing 

*The writer wishes to acknowledge helpful discussions of this paper with 


Paul Horst and Herbert S. Sichel who have worked on various aspects of the 
problem of invariant item parameters. 
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one set of symbols to designate the selected group and another set to 
designate the unselected group we will say that lower-case letters will 
be used to designate the group for which all the parameters are avail- 
able; upper-case letters will be used to designate the group for which 
some information is available and for which information is desired 
on other parameters. That is to say, the unknowns to be solved for, 
will always be designated by upper-case letters. It will be noted dur- 
ing the derivation that no assumption is made regarding the direction 
of the change so that the formulas hold for utilizing information 
from an unselected group to estimate parameters in a selected group 
or for utilizing parameters from a selected group to estimate those 
in an unselected group. 


I. Definitions and Assumptions 


We will consider three major types of variables, as follows: 


X, «x is used to designate the explicit selection variable, 
that is, the variable which was directly used for se- 
lection. 


Y, y is used to represent any other variable on which se- 
lection occurs only because of its correlation with 
X (2) ; this variable may be termed the variable sub- 
ject to incidental selection. 


Iz, %, designates a gross score variable which represents 
the ability required to answer item G (or g). 
(G=1---K;9g=1---k.) For example, for a test 
of K (k) items there would be K (k) such variables, 
a different one for each item. 


It is also necessary to assume that for each item ability variable 
there exists an ability level such that all persons above this level an- 
swer the item correctly and all persons below this level answer the 
item incorrectly. 


I';, %@, will be used to designate the particular ability score 
above which all persons answer item G (g) correct- 
ly and below which all persons answer item G (g) 
incorrectly. 


For any given item it is assumed that 
'g=1, (G=gQ). (1) 
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It is also necessary to designate means, standard deviations, cor- 
relations, and proportions for the different variables. 


M, m_ (with appropriate subscripts—X , x, Y, y, G, g) 
designates the means of the various groups. 


a, * (again with appropriate subscripts—X ,2,Y,y, 
G,g) designates the standard deviations. 


R, rr (with appropriate subscripts) designates correla- 
tions. 


Pc, p, designates the proportion of persons answering the 
item correctly. 


Zea, % designates standard scores for the variables Ig, and 
1, respectively. 


That is to say, 








‘Ig—Me ty — My 
Ze = r] 2g = ° 
Se So 
Z'c, 2, are standard score levels corresponding to Pg and 
p, and to J’, and 7’,; that is to say 





I'g—M ‘,—™m 
Z.=———,, 2,.=———. (2) 
Se Sy 
From equations (1) and (2) we see that 
Z'cSe =e Mg = Z Be + My. (3) 


In developing the theory for the influence of group selection on 
item parameters we will utilize the usual assumptions for the influ- 
ence of selection on correlation (references 1, 2, 3, 4, 5, and 7). First, 
assume that the regression line of Jg on X is identical with the regres- 
sion of i1,on x. To have identical regression lines we need to assume 
that both the slope and intercept of the regression line is the same 
for the two groups, e.g., 


Rye — = 14 — and (4) 


Mo — Rug — Mx => ™, — 1x9 — Mz. (5) 
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It is also necessary to assume that the variance about each of these 
regression lines is the same for both groups. Thus, we have 

Se ama S@?R xe = 8,7 — 89° % eq". (6) 


Correspondingly, we assume that the regression of Y on X is the 
same as the regression of y on «. Thus, we have 


S FS 
yx Roi = TT yz A and (7) 
x Sz 
Sy Sy 
My — Ryx — Mx =™, — Tye — Mz. (8) 
Sx Sz 


It is also assumed that the variance about each of these regression 
lines is the same. Thus, we have 


Sy? 4 Sy*Ryy? = 8? aoe By? ye". (9) 


In addition to the foregoing six assumptions it is also necessary to 
assume that selection is such that the correlation between the item 
ability variable and the incidental selection variable with the explicit 
selection variable partialled out is the same for both groups. That 
is to say, it is assumed that Ryc.x = y.2. Using the formula for par- 
tial correlation, we have 
Ryg —RxcR Tyg — Tro, 
YG x@lvyx i” vg oT ya (10) 
V1— Rx, V1— Ry?’ V1— fe" Vi-—?t,* 














II. Effect of Selection on Mean and Variance 


We will first show the general formulas for the effect of selection 
on mean and variance. From equations (4) and (6) we see that 


Se S;? 
——--— 1 =-7,,* ——1 ry (11) 
8,? og 


The relation between the variance of the incidental and explicit se- 
lection variables can be obtained from equations (7) and (9) giving 


Sy? Sx? 
——1="7,, —-—1 . (12) 
8,* 8." 


In some cases when dealing with items the correlation 7, will vary 
for each different selection of cases on each item while it could con- 
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ceivably be true that Ryx would be the same for all items since we 
are thinking of an unselected total group. In this case we may utilize 
formulas (7) and (9) to write 


8.2 ae 
——1=—>Ry;’? ——1)}. (13) 
Sy Sx? 


Either equation (12) or (13) shows the change in variance of the in- 
cidental and explicit selection variables as a function either of 72 or 
Ryx. The formulas showing change in means are also needed. From 
equations (4) and (5) we have 


Ss 
ty — Mg = Veg — (1%e — Mx). (14) 


Sz 


From equations (7) and (8) we have 


Ss 
My — My =Ty2— (mM,—Mxz), and (15) 
Sz 
Sy 
m, — My = Ryx — (m,— Mx). (16) 
Ox 


Equations (15) and (16) show the change in mean of the explicit 
and incidental selection variables as a function of 7,, or Ryx , which- 
ever is the more convenient. 

In the subsequent derivations we will assume that at least three 
of the four means (m,, My, mz, Mx) are known. If any three are 
known, equations (15) or (16) may be utilized to obtain the other one. 
If all four are known, equations (15) and (16) may be utilized to check 
on the agreement of the data with the basic assumptions indicated by 
equations (7) and (8). Correspondingly, we assume that three out 
of the four standard deviations (s,, Sy, s:, Sx) are known. If any 
three are known, equations (12) or (13) may be utilized to obtain the 
other one. If all four are known, equations (12) and (13) may be 
utilized to check the reasonableness of assumptions (7) and (9). It 
is also necessary to know either 7,, or Ryx. In addition it will be 
found that 7,,, 7y,, and 2’, must be known in order to utilize the for- 
mulas that are to follow. 

Given the information indicated above we now turn to the prob- 
lem of estimating the effect of selection on item difficulty and item 
validity indexes. 

So far, the writer has not found it possible to complete the solu- 
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tion without some assumption regarding the distribution of the item 
ability variable (J¢, i,). Two different assumptions regarding the 
distribution of this variable are presented. Section III develops a set 
of formulas under the assumption that variable i, may be assumed to 
be normally distributed. Thus it becomes reasonable to estimate r,, 
and r,, by means of biserial correlation coefficients. Section IV utilizes 
the assumption as suggested by Gillman and Goode (3) that variable 
i, is normally distributed about the regression of i, on x for each of 
the x-arrays. The first assumption is that the marginal totals of 7, 
are normally distributed. The second assumption is that the devia- 
tions from the regression line for each array are normally distributed. 
It may be that there are some conditions under which one assumption 
is superior and some conditions under which the other assumption is 
superior. No studies have been made on this topic, so far as the writer 
is aware. 


III. A Normal Distribution Assumed for Item Ability (ig) 
(a) Effect of Selection on Item Difficulty and Correlation. 


If it is assumed that the distribution of 7, is nearly normal, we 
may obtain numerical values for r,, and 7,, by use of the formula for 
the biserial correlation coefficient. It should be noted that the use of bi- 
serial correlation assumes only a normal distribution of i,. No re- 
strictions are implied by the biserial correlation coefficient on the 
variance of i,. It should be noted also that p, is known from the data 
and since i, is assumed to be normally distributed, 2’, can be found 
as a function of », from tables of the normal curve. Thus we have, 
given by the data, numerical values for r,,, 7y,, and 2’, provided we 

_are willing to assume a normal distribution of i, . 

With this information, values of Ry;, Z'¢, and Ryg can be ob- 
tained by the following procedures. From equations (4) and (11) we 
have 

To, 
Rig= ss . (17) 
Vi eg?Sx? + 82? (1 — 1 ey?) 


From equations (3), (11), and (14) we find 


822’, + (mz— Mz) Tey 








(18) 





em 





Veg? S;? os 8," (1 — Tes") , 


This equation gives Z’, as a function of the means and standard devi- 
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ations of the explicit selection variable (m,, Mx, 8., Sx). It may be 
noted that Z’. may also be expressed as a function of the means and 
standard deviations of the incidental selection variable (m,, My, s,, 
Sy) as well as the other variabies (2',, 7yz,12;) by utilizing (12) and 
(15) in equation (18). 

It will be noted that Z’, is a standard score as defined by equa- 
tion (2) representing the ability level on variable Ig required for 
answering item G. If one assumes a normal distribution for the vari- 
able Jz then P, (the proportion of persons in the new group who 
would be expected to answer the item correctly) may be computed. In 
particular, if a normal distribution of J, is assumed then tables of the 
normal curve may be utilized to obtain Pg from Z',. 

By substituting values from equations (4), (6), (7), (9), (11), 
and (12) in (10) we find 


Sy 
Tyg Tye i ra( ge ee ar 1 


8,7 


Ryg= . (19) 


Sy Sy’ 
—— Tye" -f- ra'(— came 1 ) 
sy V 8,’ 








It should be noted that (19) gives Rye as a function of r,.. If it is 
more convenient to utilize the value Ryx we obtain the appropriate 
equation by substituting (9) in (19) obtaining 


Sy’ Sy? 
tm (2-1) + em (yx? — 1) +1 
sy sy" 








(20) 


ye¢— 





Sy | Sy’ Sy? 
— Teg" (—-1)+= Gt +1 
Sy 8,’ 8y? 
Equations (19) and (20) require the value of both variances for the 
incidental selection variable, e.g., Sy (s,). 

If it is more convenient to obtain Ry¢ in terms of the variances 
of the explicit selection variable (Sx, s,) then the proper formula 
may be found by substituting (12) in (19) obtaining 


Sx? 
Tyg +1 tn _ 1 
Ryg= - : (21) 








Sx? Sx? 
Jf t+ne (1) ia ——1) 
8," 8," 
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which gives R;, in terms of 7,, and the variances S,* and s,?. We 
may substitute equation (13) in (20) and simplify obtaining 





Tag Ryx (Sx? — 827) + Ty S2eV Sx* — Kyx* (Sx* — 82") 
Ryg = » (22) 
SxV 29" (Sx” i 8,7) + 8,? 


which gives Ry, in terms of Ryx and the variances S*x and s?,. 
Equations (17) to (22) inclusive show the changes to be ex- 

pected in the item difficulty parameter and item correlation depend- 

ent on various values of mean and standard deviation of the group. 


(b) Item Indexes Which Do Not Vary Systematically with Changes 
in Group Ability. 

It is possible to devise item indexes that would not be expected to 
change systematically with group selection by the following proce- 
dure. Divide equation (4) by the square root of equation (6) and 
designate the ratio by A”,,. This gives 

Teg Rxe 
A" y= = . (23) 
SeV1—Te? SxV¥1— kx 











It will be noted that both equations (4) and (6) are assumed invari- 
ant with group selection. The unknown values s, and S, disappeared 
on dividing, so that equation (28) gives an invariant function of the 
standard deviation and correlation. The item index A”,, may be con- 
verted into Rx for any arbitrary standard deviation Sy by equation 


As S. 
Vil + A wt Sx? 


Thus we see that if the item ability may be assumed to be nearly 
normally distributed A”,, may be an invariant index from which the 
correlation Rx¢ could be obtained for groups with any specified vari- 
ability (Sx). 

With respect to item difficulty we may use X”, to designate the 
ability level for variable X such that half the persons pass and half 
fail item G. This means that we must find the point where the re- 
gression of J< on X crosses the critical ability level Z’,. From equa- 
tions (8), (4), and (5) we have 

X’.= 29 Sz 1 wei 
V2g XG 


(24) 





Rxyg= 





+ My. (25) 








This equation gives an invariant relationship that should represent 
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the ability level at which half the persons will pass and half fail item 
G. 

As yet no useful invariant relationship between two incidental 
selection variables has been devised. Some invariant function of ry 
and Ry, would be rather interesting to obtain. For example, if equa- 
tion (10) is multiplied by the square root of equations (6) and (9), 
divided by equation (4), and multiplied by equation (7), we see that 


Tyg Ry« 
Sy? Tye” ——1 )=Sy? Ryx’ —1). (26) 
Ryx Rxe 


Yye Vag 








However, three different correlation coefficients are involved in this 
relationship. An index such as that indicated by equation (26) does 
not seem to be particularly useful at present. 


IV. Normal Distribution of Item Ability for Each X-Array 


Instead of assuming a normal distribution for the marginal totals 
on variable Jz we may assume that the distribution of I, is normal 
(or nearly normal) for each of the X-arrays. This assumption has 
been presented by Gillman and Goode (3) and the index suggested 
by them utilized by Sichel (6). If we say that there are h different 
arrays of X we may use the subscript (f —1--- h) to designate the 
array and say that I;, is assumed to be normally distributed for each 
value of f andg. 

Let x; be the mid-point of the fth array of z, 


»;, be the per cent of persons in this fth array who pass item g 
(e.g., the per cent above ability level 7’, for item g), 


uy, be defined as a base line deviation on a normal curve corre- 
sponding to the area p;,. Thus for each array f , uy, is given 
as a function of the known value py, . 


We may also adopt the convention that 


if p;, > 50, then u;, > 0; 
if p;, < 50, then u;, <0. 


Thus, for any given item g we have pairs of values u;, and x; (f =1 
---h). A plot of these values (uw vs. x) can be made for each item. 
This method of analysis has the advantage that if the assumptions 
involved are correct the plot will be approximately linear or, con- 
versely, we may say if the plot of u,;, against x, (f =1---h) fora 
given item g is markedly nonlinear it shows that the assumptions 











294 PSYCHOMETRIKA 


made do not hold for that particular item. One thus has a check on 
the reasonableness of the basic assumptions. If the variables x and 
i, (or its transform u,) are linearly related then we have equation 


Utg = Aged — Ba . (27) 


The values of A,- and B,, may be determined from the graph by 
some appropriate curve-fitting method. One may use any method from 
rough graphic approximations to a least-squares method. However, 
given the plot of u;, against x; for a given item, numerical values for 
A,z and B,, can be determined. 

To interpret A,, and B,,, we note from the regression of i, on x 
that the point on the regression line (which may be designated 
i,— m,) is given by the expression i, =1', + uUyy 8, V1 — 1ey?. Since 
this is the point on the regression line corresponding to any particu- 
lar value such as x; we may write 


” ses 8g 
Ug > is Ufg 8) V1—-7,,? = Tag (2%; — My) + M,. (28) 


If equation (28) is solved explicitly for u;, as a function of a; 
we find the following values for the constants A,, and B,, of equation 


(27) 
Vag Sg Teg ‘| 
Age = ena — (29) 
Be 8g 1 — te? VI— tes \ Se 








and 


Tag Mz 5—M, 
a 


Buz = (30) 





8.V1 te Tes? 8yV1 sn Tas" 


Equations (29) and (30) thus give the interpretation of the empiri- 
cally obtained constants A,, and B,,. It will be noted that Aj,-z as in- 
terpreted by equation (29) is identical with A”,,, as defined in equa- 
tion (23), and should have identical properties as an invariant item 
index. 


We now turn to the problem of a suitable item difficulty index. At 
the point where u;, = 0 we may specify that 2; is indicated by X¢-. 
This point is the one where half the persons pass and half fail the 
item. If we set u,;, = 0 and x; = X, in equation (27) and utilize equa- 
tions (29) and (30) to solve for Xo we find 
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Ba v —™m™, Sz 
Sip, 4 ee, (31) 


2 So ag 





Utilizing (2), (31) may be rewritten as 


Xe=me+2,(—). (32) 


7 “- 


It may be noted that equation (32) is identical with equation (25). 
That is, the index X, given by equation (31), is identical with the 
index X”, given in equation (25). 

Thus we see that if the assumptions used in the derivation are 
satisfied, Aj, (or A” ec) and X¢ (or X",) are item indexes which should 
be invariant with respect to changes in the standard deviation of the 
variable x. Equations (23) and (25) give values for these indexes 
derived from the assumption that the marginal totals of 7, for per- 
sons answering the item may be regarded as normally distributed. 
Equations (27), (29), and (31) give values for these indexes based 
on the assumption that the distribution of 7, for each array of x may 
be regarded as normal. An investigation of the constancy of these 
indexes as arrived at by the two different sets of formulas would in- 
dicate which assumption was the better. 

It should be noted that by utilizing the approach indicated by 
Gillman and Goode (3) it would also be possible to obtain the corre- 
lation between y and 7, for each array of variable x. This correla- 
tion would correspond to the partial correlation between y and i— 
holding x constant. One would thus empirically verify the assump- 
tion of equal partial correlation which is utilized in equation (10). 
One might then use some invariant index such as that given in equa- 
tion (26) to obtain Ry¢ for any distribution for which Sy , Sy, and Ryx 
were known. The value of Rr for use in equation (26) could be ob- 
tained by the use of equation (24). 


V. Summary 


In dealing with the problem of item indexes and group variabil- 
ity two different assumptions have been suggested. The assumption 
that the item ability variable is normally distributed gives equations 
(17) to (22) showing changes in item parameters with changes in 
group mean and variance. Under this assumption equations (23) 
and (25) give item indexes which should be relatively invariant with 
respect to changes in group mean and standard deviation. The as- 
sumption that there is a normal distribution for each x-array as uti- 
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lized by Gillman and Goode has also been suggested. By means of a 
curve-fitting procedure we may find values for the item parameters 
indicated in equations (27), (29), and (31). These parameters should 
be invariant with respect to changes in group mean and variance. An 
investigation of the conditions under which either of the sets of in- 
dices proposed would be relatively invariant with selection is still to 
be made. 
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COEFFICIENT ALPHA AND THE INTERNAL 
STRUCTURE OF TESTS* 
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A general formula (a) of which a special case is the Kuder- 
Richardson coefficient of equivalence is shown to be the mean of all 
split-half coefficients resulting from different splittings of a test. a 
is therefore an estimate of the correlation between two random sam- 
ples of items from a universe of items like those in the test. a is 
found to be an appropriate index of equivalence and, except for very 
short tests, of the first-factor concentration in the test. Tests di- 
visible into distinct subtests should be so divided before using the 
formula. The index 3 , derived from a, is shown to be an index of 
inter-item homogeneity. Comparison is made to the Guttman and 
Loevinger approaches. Parallel split coefficients are shown to be un- 
necessary for tests of common types. In designing tests, maximum 
interpretability of scores is obtained by increasing the first-factor 
concentration in any separately-scored subtest and avoiding sub- 
stantial group-factor clusters within a subtest. Scalability is not a 
requisite. 


I. Historical Resumé 

Any research based on measurement must be concerned with the 
accuracy or dependability or, as we usually call it, reliability of meas- 
urement. A reliability coefficient demonstrates whether the test de- 
signer was correct in expecting a certain collection of items to yield 
interpretable statements about individual differences (25). 

Even those investigators who regard reliability as a pale shadow 
of the more vital matter of validity cannot avoid considering the re- 
liability of their measures. No validity coefficient and no factor analy- 
sis can be interpreted without some appropriate estimate of the mag- 
nitude of the error of measurement. The preferred way to find out 
how accurate one’s measures are is to make two independent measure- 
ments and compare them. In practice, psychologists and educators 
have often not had the opportunity to recapture their subjects for a 
second test. Clinical tests, or those used for vocational guidance, are 
generally worked into a crowded schedule, and there is always a de- 

*The assistance of Dora Damrin and Willard Warrington is gratefully ac- 
knowledged. Miss Damrin took major responsibility for the empirical studies re- 


ported. This research was supported by the Bureau of Research and Service, Col- 
lege of Education. 


297 








298 PSYCHOMETRIKA 


sire to give additional tests if any extra time becomes available. 
Purely scientific investigations fare little better. It is hard enough 
to schedule twenty tests for a factorial study, let alone scheduling 
another twenty just to determine reliability. 


This difficulty was first circumvented by the invention of the split- 
half approach, whereby the test is rescored, half the items at a time, 
to get two estimates. The Spearman-Brown formula is then applied 
to get a coefficient similar to the correlation between two forms. The 
split-half Spearman-Brown procedure has been a standard method of 
test analysis for forty years. Alternative formulas have been devel- 
oped, some of which have advantages over the original. In the course 
of our development, we shall review those formulas and show rela- 
tions between them. 


The conventional split-half approach has been repeatedly criti- 
cized. One line of criticism has been that split-half coefficients do not 
give the same information as the correlation between two forms given 
at different times. This difficulty is purely semantic (9, 14) ; the two 
coefficients are measures of different qualities and should not be iden- 
tified by the same unqualified appellation “reliability.” A retest after 
an interval, using the identical test, indicates how stable scores are 
and therefore can be called a coefficient of stability. The correlation 
between two forms given virtually at the same time, is a coefficient 
of equivalence, showing how nearly two measures of the same general 
trait agree. Then the coefficient using comparable forms with an in- 
terval between testings is a coefficient of equivalence and stability. 
This paper will concentrate on coefficients of equivalence. 


The split-half approach was criticized, first by, Brownell (3), 
later by Kuder and Richardson (26), because of its lack of unique- 
ness. Instead of giving a single coefficient for the test, the procedure 
gives different coefficients depending on which items are grouped 
when the test is split in two parts. If one split may give a higher co- 
efficient than another, one can have little faith in whatever result is 
obtained from a single split. This criticism is with equal justice ap- 
plicable to any equivalent-forms coefficient. Such a coefficient is a 
property of a pair of tests, not a single test. Where four forms of a 
test have been prepared and intercorrelated, six values are obtained, 
and no one of these is the unique coefficient for Form A; rather, each 
is the coefficient showing the equivalence of one form to another spe- 
cific form. 


Kuder and Richardson derive a series of coefficients using data 
from a single trial, each of them being an approximation to the inter- 
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form coefficient of equivalence. Of the several formulas, one has been 
justifiably preferred by test workers. In this paper we shall be espe- 
cially concerned with this, their formula (20): 


- =Pigi 
Ytt(KR20) — (1 : ) ((=1,2,---n). (1) 


n—1 ot 








Here, 7 represents an item, p; the proportion receiving a score of 1, 
and q; the proportion receiving a score of zero on the item. 
We can write the more general formula 


/ DVi 

n i 

a—  iaaren ° (2) 
n—1 V; 


Here V; is the variance of test scores, and V; is the variance of item 
scores after weighting. This formula reduces to (1) when all items 
are scored 1 or zero. The variants reported by Dressel (10) for cer- 
tain weighted scorings, such as Rights-minus-Wrongs, are also spe- 
cial cases of (2), but for most data computation directly from (2) is 
simpler than by Dressel’s method. Hoyt’s derivation (20) arrives 
at a formula identical to (2), although he draws attention to its ap- 
plication only to the case where items are scored 1 or 0. Following 
the pattern of any of the other published derivations of (1) (19, 22), 
making the same assumptions but imposing no limit on the scoring 
pattern, will permit one to derive (2). 

Since each writer offering a derivation used his own set of as- 
sumptions, and in some cases criticized those used by his predeces- 
sors, the precise meaning of the formula became obscured. The origi- 
nal derivation unquestionably made much more stringent assumptions 
than necessary, which made it seem as if the formula could properly 
be applied only to rare tests which happened to fit these conditions. 
It has generally been stated that a gives a lower bound to “the true 
reliability”—whatever that means to that particular writer. In this 
paper, we take formula (2) as given, and make no assumptions re- 
garding it. Instead, we proceed in the opposite direction, examining 
the properties of a and thereby arriving at an interpretation. 

We introduce the symbol a partly as a convenience. “Kuder- 
Richardson Formula 20” is an awkward handle for a tool that we ex- 
pect to become increasingly prominent in the test literature. A second 
reason for the symbol is that a is one of a set of six analogous coeffi- 
cients (to be designated 6, y, 6, etc.) which deal with such other 
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concepts as like-mindedness of persons, stability of scores, etc. Since 
we are concentrating in this paper on equivalence, the first of the six 
properties, description of the five analogous coefficients is reserved 
for later publication. 

Critical comments on the Kuder-Richardson formula have been 
primarily directed to the fact that when inequalities are used in de- 
riving a lower bound, there is no way of knowing whether a particu- 
lar coefficient is a close estimate of the desired measure of equivalence 
or a gross underestimate. The Kuder-Richardson method is an over- 
all measure of internal consistency, but a test which is not internally 
homogeneous may nonetheless have a high correlation with a care- 
fully-planned equivalent form. In fact, items within each test may 
correlate zero, and yet the two tests may correlate perfectly if there 
is item-to-item correspondence of content. 

The essential problem set in this paper is: How shall a be inter- 
preted? a, we find, is the average of all the possible split-half coeffi- 
cients for a given test. Juxtaposed with further analysis of the varia- 
tion of split-half coefficients from split to split, and with an examina- 
tion of the relation of a to item homogeneity, this relation leads to 
recommendations for estimating coefficients of equivalence and homo- 
geneity. 


II. A Comparison of Split-Half Formulas 

The problem set by those who have worked out formulas for split- 
half coefficients is to predict the correlation between two equivalent 
whole tests, when data on two half-tests are at hand. This requires 
them to define equivalent tests in mathematical terms. 

The first definition is that introduced by Brown (2) and by 
Spearman (33), namely, that we seek to predict correlation with a 
test whose halves are c and d, possessing data from a test whose 
halves are a and Db, and that 


Tab = Nac = Nad = Toc = Voda = Tea « (3) 
This assumption or definition is far from general. For many splittings 
V, # V,, and an equivalent form conforming to this definition is im- 
possible. 


A more general specification of equivalence credited to Flanagan 
[see (25)] is that 


V (ast) — V esa) > and 
NabFaFh = NadGald = TocOFe = Nea Oa = +++. (4) 
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This assumption leads to various formulas which are collected in the 
first column of Table 1. All formulas in Column A are mathematically 
identical and interchangeable. 


TABLE 1 
Formulas for Split-Half Coefficients 








Formulas Assuming Equal 


























Entering Data* Covariances Between Formulas Assuming 
Half-Tests o,=4% 
1A; 1Bt 
Tap %a % 49,957 4p 27 4p 
o,? a 0,” + 20,557 ab 1 + Tab 
2A§ 
o,0, 6 a2 o,2 
t¢"a~b Ci a + b 
o,? 
3 All 
0,9, % a5 4 (194949; — 9%?) 
o,? 
4AJ 4B (=4A) 
o,¢d o;2 o,2 
td tux —* — —~.. 
oF “i & 
5A 5B 
0,54 Tod 4 (e,3 wes %,%q% aa) 2 (20,2 woe 04°) 
40,2 ate a)? a 40,07 a4 40,” a o,? 





*In this table, a@ and 0 are the half-test scores, §Guttman (19) 
t=a+b,d=a-b. \|After Mosier (28) 
yAfter Flanagan (25) {Rulon (31) 
¢Spearman-Brown (2, 33) 


When a particular split is such that o, = o,, the Flanagan re- 
quirement reduces to the original Spearman-Brown assumption, and 
in that case we arrive at the formulas in Column B. Formulas 1B 
and 5B are not identical, since the assumption enters the formulas in 
different ways. No short formula is provided opposite 2A or 3A, since 
these exact formulas are themselves quite simple to compute. 

Because of the wide usage of Formula 1B, the Spearman-Brown, 
it is of interest to determine how much difference it makes which 
assumption is employed. If we divide 1B by any of the formulas in 
Column A we obtain the ratio 

2mr +m? +1 1 ewe ++ 
= = ( (5) 





2m 


Qm(l+r)  (1+7r) 
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in which m = o/o., 2 < op, and r signifies 7,.,. The ratio when 5B 
is divided by any of the formulas in the first column is as follows: 
= (2mr — m? + 1) (1 + 2mr + m?) 


k= . (6) 
2mr (2mr — m? + 3) 





When m equals 1, that is, when the two standard deviations are equal, 
the formula in Column B is identical to that in Column A. As Table 
2 shows, there is increasing disagreement between Formula 1B and 
those in Column A as m departs from unity. The estimate by the 
Spearman-Brown formula is always slightly larger than the coeffi- 
cient of equivalence computed by the more tenable definition of com- 
parability. 


TABLE 2 
Ratio of Spearman-Brown Estimate to More Exact Split-Half Estimate of 
Coefficient of Equivalence when S.D.’s are Unequal 











Ratio of 
Half-Test 
S.D.’s Correlation Between Half-Tests 
(greater/lesser) 
.00 .20 40 .60 80 1.00 

1 1 1 1 1 f 1 
1.1 1.005 1.004 1.003 1.003 1.003 1.002 
i2 1.017 1.014 1.012 1.010 1.009 1.008 
1.3 1.085 1.029 1.025 1.022 1.020 1.017 
1.4 1.057 1.048 1.041 1.036 1.032 1.029 
1.5 1.083 1.069 1.060 1.052 1.046 1.042 





Formula 5B is not so close an approximation to the results from 
formulas in Column A. When ™m is 1.1, for example, the values of ks 
are as follows: for r= .20, .62; for r = .60, .70; for r = 1.00, .999. 

It is recommended that the interchangeable formulas 2A and 4A 
be used in obtaining split-half coefficients. These formulas involve 
no assumptions contradictory to the data. They are therefore prefer- 
able to the Spearman-Brown formula. However, if the ratio of the 
standard deviations of the half-tests is between .9 and 1.1, the Spear- 
man-Brown formula gives essentially the same result. This finding 
agrees with Kelley’s earlier analysis of much the same question (2, 3). 


III. a as the Mean of Split-Half Coefficients 


To demonstrate the relation between a and the split-half formu- 
las, we shall need the following notation: 
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Let n be the number of items. 
The test t is divided into two half-tests, a and b. 7’ will desig- 
; nate any item of half-test a, and i” will designate any item of half- 
test b. Each half-test contains n’ items, where n’ = 7/2. 

V;:, V., and V, are the variances of the total test and the respec- 
tive half-tests. 

C;; is the covariance of two items 7 and j. 

C, is the total covariance for all items in pairs within half-test 
a, each pair counted once; C, is the corresponding “within-test” co- 
variance for b. 
: C; is the total covariance of all item pairs within the test. 


DETAILED PORTION OF MATRIX 
ITEMS IN aie ioe 3 


FIRST HALF SECOND HALF 








. ee 


aii, 
ITEMS IN FIRST HALF 
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FIGURE 1 


Schematic Division of the Matrix of Item Variances and Covariances. 
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C.» is the total covariance of all item pairs such that one item is 
within a and the other is within b; it is the “between halves” covari- 
ance. 


Then 
Ca = TabFa% 5 (7) 
C.=—C, +0, + Ca; (8) 
V; — Vv. +- V, + 2C wr = DV; + 2C;; and (9) 
4 
V, => Vie =e 2C,, and Vz => Vi. = 2C). (10) 
+ i" 


These identities are readily visible in the sketches of Figure 1, which 
is based on the matrix of item covariances and variances. Each point 
along the diagonal represents a variance. The sum of all entries in 
the square is the test variance. 

Rewriting split-half formula 2A, we have 


Vat Vs Te"Fa—Fo 
t t 








4C, 
tutta, (12) 


t 





This indicates that whether a particular split gives a high or low co- 
efficient depends on whether the high interitem covariances are placed 
in the “between halves” covariance or whether the items having high 
correlations are placed instead within the same half. 


Now we rewrite a: 


ea VeZ Vs 
nN i n i 
a= (.. ener = | — Fs (13) 
n—1 V; n—1\ Vi 




















nN 2C; 
a= _—, (14) 
n—1 V; 
-. C; 
Gye —. (15) 
n(n—1)/2 
Therefore 
nm Ci; 
c= : (16) 


V; 
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We proceed now by determining the mean coefficient from all 
(2n') !/2(n'!)? possible splits of the test. From (12), 
_ 4a 
Tie — . (17) 
V; 





n 
In any split, a particular C;; has a probability of -s of falling 
n —_— 


into the between-halves covariance C,,. Then over all splits, 

















(2n')! n : 
> Caw = => Cy; (@=1,2,---n—1; 
2(2'!)*2(n—1) ; ; i 
g=t+1,---5n). (18) 
But 
n(n—1) - 
a> Ci = — Ci;. (19) 
ij 2 
(2n') ! n?-- 
> Ca = — ij, (20) 
2(n'!)? 4 
and 
pms n? — 
Ca = — Ci;. (21) 
4 
From (17), 
ce Metin. (22) 
Tre — — O75, = 
. ee 
Therefore 
Tit =a. (23) 
From (14), we can also write a in the form 
a> Ci; 
n a4 
a= —-; (i, =1,2,---n;i# J). (24) 
n—1 V; 


This important relation states a clear meaning for a as n/(n—1) times 
the ratio of interitem covariance to total variance. The multiplier 
n/(n—1) allows for the proportion of variance in any item which is 
due to the same elements as the covariance. 
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a as a special case of the split-half coefficient. Not only isaa 
function of all the split-half coefficients for a test; it can also be shown 
to be a special case of the split-half coefiicient. 

If we assume that the test is divided into equivalent halves such 
that Cj; (ie, C»/n'*) equals C;; , the assumptions for formula 2A 
still hold. We may designate the split-half coefficient for this splitting 
as Tit,- 








4 Cap 
ug et te (12) 
V; 
Then 
An’? E, i 4n'? o- n? Ca 
LY ee oe — (25) 
V;: Vi V; 

From (16), 

faces, (26) 


This amounts to a proof that a is an exact determination of the paral- 
lel-form correlation when we can assume that the mean covariance 
between parallel items equals the mean covariance between unpaired 
items. This is the least restrictive assumption usable in “proving” 
the Kuder-Richardson formula. 

a as the equivalence of random samples of items. The foregoing 
demonstrations show that a measures essentially the same thing as 
the split-half coefficient. If all the splits for a test were made, the 
mean of the coefficients obtained would be a. When we make only 
one split, and make that split at random, we obtain a value somewhere 
in the distribution of which a is the mean. If split-half coefficients are 
distributed more or less symmetrically, an obtained split-half coeffi- 
cient will be higher than a about as often as it is lower than a. This 
average that is a is based on the very best splits and also on some 
very poor splits where the items going into the two halves are quite 
unlike each other. 

Suppose we have a universe of items for which the mean covari- 
ance is the same as the mean covariance within the given test. Then 
suppose two tests are made by twice sampling n items at random 
from this universe without replacement, and administered at the same 
sitting. Their correlation would be a coefficient of equivalence. The 
mean of such coefficients would be the same as the computed a. a is 
therefore an estimate of the correlation expected between two tests 
drawn at random from a pool of items like the items in this test. Items 
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are not selected at random for psychological tests where any differen- 
tiation among the items’ contents or difficulties permits a planned 
selection. Two planned samplings may be expected to have higher 
correlations than two random samplings, as Kelley pointed out (25). 
We shall show that this difference is usually small. 


IV. An Examination of Previous Interpretations and Criticisms of a 


1. Is a a conservative estimate of reliability? The findings 
just presented call into question the frequently repeated state- 
ment that a is a conservative estimate or an underestimate or a lower 
bound to “the reliability coefficient.” The source of this conception 
is the original derivation, where Kuder and Richardson set up a def- 
inition of two equivalent tests, expressed their correlation algebra- 
ically, and proceeded to show by inequalities that a was lower than 
this correlation. Kuder and Richardson assumed that corresponding 
items in test and parallel test have the same common content and the 
same specific content, i.e., that they are as alike as two trials of the 
same item would be. In other words, they took the zero-interval re- 
test correlation as their standard. Guttman also began his derivation 
by defining equivalent tests as identical. Coombs (6) offers the some- 
what more satisfactory name “coefficient of precision” for this index 
which reports the absolute minimum error to be found if the same 
instrument is applied twice independently to the same subject. A co- 
efficient of stability can be obtained by making the two observations 
with any desired interval between. A rigorous definition of the co- 
efficient of precision, then, is that it is the limit of the coefficient of 
stability, as the time between testings becomes infinitesimal. 

Obviously, any coefficient of equivalence is less than the coeffi- 
cient of precision, for one is based on a comparison of different items, 
the other on two trials of the same items. To put it another way: a 
or any other coefficient of equivalence treats the specific content of an 
item as error, but the coefficient of precision treats it as part of the 
thing being measured. It is very doubtful if testers have any practi- 
cal need for a coefficient of precision. There is no practical testing 
problem where the items in the test and only these items constitute the 
trait under examination. We may be unable to compose more items 
because of our limited skill as testmakers but any group of items in a 
test of intelligence or knowledge or emotionality is regarded as a sam- 
ple of items. If there weren’t “plenty more where these came from,” 
performance on the test would not represent performance on any more 
significant variable. 
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We therefore turn to the question, does a underestimate appro- 
priate coefficients of equivalence? Following Kelley’s argument, the 
way to make equivalent tests is to make them as similar as possible, 
similar in distribution of item difficulty and in item content. A pair 
of tests so designed that corresponding items measure the same fac- 
tors, even if each one also contains some specific variance, will have a 
higher correlation than a pair of tests drawn at random from the 
pool of items. A planned split, where items in opposite halves are as 
similar as the test permits, may logically be expected to have a higher 
between-halves covariance than within-halves covariance, and in that 
case, the obtained coefficient would be larger than a. a is the same 
type of coefficient as the split-half coefficient, and while it may be low- 
er, it may also be higher than the value obtained by actually splitting 
a particular test at random. Both the random or odd-even split-half 
coefficient and a will theoretically be lower than the coefficient from 
parallel forms or parallel splits. 

2. Isa less than the coefficient of stability? Some writers ex- 
pect a to be lower than the coefficient of stability. Thus Guttman says 
(34, p. 311): 

For the case of scale scores, then, . . . we have the assurance that if 


the items are approximately scalable [in which case a will be high], then 
they necessarily have very substantial test-retest reliability. 


Guilford says (16, p. 485): 


There can be very low internal consistency and yet substantial or 
high retest reliability. It is probably not true, however, that there can be 
high internal consistency and at the same time low retest reliability, ex- 
cept after very long time intervals. If the two indices of reliability dis- 
agree for a test, we can place some confidence in the inference that the 
test is heterogeneous. 


The comment by Guttman is based on sound thinking, provided 
we reinterpret test-retest coefficient on the basis of the context of the 
comment to refer to the instantaneous retest (i.e., coefficient of pre- 
cision) rather than the retest after elapsed time. Guilford’s statement 
is acceptable only if viewed as a summary of his experience. There 
is no mathematical necessity for his remarks to be true. In the co- 
efficient of stability, variance in total score between trials (within per- 
sons) is regarded as a source of error, and variance in specific fac- 
tors (between items within persons) within trials is regarded as true 
variance. In the coefficient of equivalence, such as a, this is just re- 
versed: variance in specific factors is treated as error. Variation 
between trials is non-existent and does not reduce true variance (9). 
Whether the coefficient of stability is higher or lower than the co- 
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efficient of equivalence depends on the relative magnitude of these 
variances, both of which are likely to be small for long tests of stable 
variables. Tests are also used for unstable variables such as mood, 
morale, social interaction, and daily work output, and studies of this 
sort are becoming increasingly prominent. Suppose one builds a 
homogeneous scale to obtain students’ evaluations of each day’s class- 
work, the students marking the checklist at the end of each class hour. 
Homogeneous items could be found for this. Yet the scale would have 
marked instability from day to day, if class activities varied or the 
topics discussed had different interest value for different students. 

The only proper conclusion is that a may be either higher or low- 
er than the coefficient of stability over an interval of time. 

3. Are coefficients from parallel splits appreciably higher than 
random-split coefficients or a? The logical presumption is strong 
that planned splits as proposed by Kelley (25) and Cronbach (7) 
would yield coefficients nearer to the equivalent-tests coefficient than 
random splits do. There is still the empirical question whether this 
advantage is large enough to be considered seriously. This raises 
two questions: Is there appreciable variation in coefficients from split 
to split? If so, does the judgment made in splitting the test into a 
priori equivalent halves raise the coefficient? Brownell (3), Cronbach 
(8), and Clark (5) have compared coefficients obtained by splitting 
a test in many ways. There is doubt that the variation among co- 
efficients is ordinarily a serious matter; Clark in particular found that 
variation from split to split was small compared to variation arising 
from sampling of subjects. 

Empirical evidence. To obtain further data on this question, two 
analyses were made. One employs responses of 250 ninth-grade boys 
who took Mechanical Reasoning Test Form A of the Differential 
Abilities Tests. The second study uses a ten-item morale scale, adapted 
from the Rundquist-Sletto General Morale Scale by Donald M. Sharpe 
and administered by him to teachers and school administrators.* 

The Mechanical Reasoning Test seems to contain items requir- 
ing specific knowledges regarding pulleys, gears, etc. Other items 
seem to be answerable on the basis of general experience or reason- 
ing. The items seemed to represent sufficiently heterogeneous content 
that grouping into parallel splits would be possible. We found, how- 
ever, that items grouped on a priori grounds had no higher correla- 
tions than items believed to be unlike in content. This finding is con- 


*Thanks are expressed to Dr. A. G. Wesman and the Psychological Corpora- 
tion, and to Dr. Sharpe, for making available the data for the two studies, re- 


spectively. 
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firmed by Air Force psychologists who made a similar attempt to cate- 
gorize items from a mechanical reasoning test and found that they 
could not. These items, they note, “are typically complex factorially” 
(15, p. 309). 

Eight items which some students omitted were dropped. An item 
analysis was made for 50 papers. Using this information, ten paral- 
lel splits were made such that items in opposite halves had comparable 
difficulty. These we call Type I splits. Then eight more splits were 
made, placing items in opposite halves on the basis of both difficulty 
and apparent content (Type II splits). Fifteen random splits were 
made. For all splits, Formula 2A was applied, using the 200 remain- 
ing cases. Results appear in Table 3. 


TABLE 3 


Summary of Data from Repeated Splittings of Mechanical Reasoning Test 
(60 items; a = .811) 
Splits Where 








All Splits 1.05 > 9,/0, > .95 
Type of Split No. of No. of 
Coefii- Range Mean Coeffi- Range Mean 
cients cients 
Random 15 -779-.860 .810 8 -795-.860 817 
Parallel Type I 10 -798—-.846 .820 6 -798-.846 822 
Parallel Type II 8 .801-.833 817 4 .809-.826 .818 


There are only 126 possible splits for the morale test, and it is 
possible to compute all half-test standard deviations directly from the 
item variances and covariances. Of the 126 splits, six were designated 
in advance as Type II parallel splits, on the basis of content and an 
item analysis of a supplementary sample of papers. Results based on 
200 cases appear in Table 4. 


TABLE 4 


Summary of Data from Repeated Splittings of Morale Scale 
(10 items; a = .715) 


Splits Where 








All Splits 11> ,/0, > .9 
Type of Split No. of No. of 
Coeffi- Range Mean Coeffi- Range Mean 
cients cients 
All Splits 126 .609-.797 715 82 .609-..797 ry 6 keg 
Parallel (Type II) 6 .681-.780 7387 5 a 780 748 
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The highest and lowest coefficients for the mechanical test differ 
by only .08, a difference which would be important only when a very 
precise estimate of reliability is needed. The range for the morale 
scale is greater (.20), but the probability of obtaining one of the ex- 
treme values in sampling is slight. Our findings agree with Clark, that 
the variation from split to split is less than the variation expected 
from sample to sample for the same split. The standard error of a 
Spearman-Brown coefficient based on 200 cases using the same split is 
.03 when 7;; = .8, .04 when 7; = .7. The former value compares with 
a standard deviation of .02 for all random-split coefficients of the me- 
chanical test. The standard error of .04 compares with a standard 
deviation of .035 for the 126 coefficients of the morale test. 

This bears on Kelley’s comment on proposals to obtain a unique 
estimate: “A determinate answer would result if the mean for all 
possible spilts were gotten, but, even neglecting the labor involved, 
this would seem to contravene the judgment of comparability.” (25, 
p. 79). As our tables show, the splittings where half-test standard 
deviations are unequal, which “contravene the judgment of compar- 
ability,” have coefficients about like those which have equal standard 
deviations. 

Combining our findings with those of Clark and Cronbach we 
have studies of seven tests which seem to show that the variation from 
split to split is too small to be of practical importance. Brownell finds 
appreciable variation, however, for the four tests he studied. The ap- 
parent contradiction is explained by the fact that the former results 
applied to tests having fairly large coefficients of equivalence (.70 or 
over). Brownell worked with tests whose coefficients were much low- 
er, and the larger range of 7’s does not represent any greater varia- 
tion in z values at this lower level. 

In Tables 3 and 4, the values obtained from deliberately equated 
half-tests differ slightly, but only slightly, from those for random 
splits. Where a is .715 for the morale scale, the mean of parallel splits 
is .748—a difference of no practical importance. One parallel split 
reaches .780, but this split could not have been defended a priori as 
more logical than the other planned splits. In Table 3, we find that 
neither Type I nor Type II splits averaged more than .01 higher than 
a. Here, then, is evidence that the sort of judgment a tester might 
make on typical items, knowing their content and difficulty, does not, 
contrary to the earlier opinion of Kelley and Cronbach, permit him to 
make more comparable half-tests than would be obtained by random 
splitting. The data from Cronbach’s earlier study agree with this. This 
conclusion seems to apply to tests of any length (the morale scale has 
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only ten items). Where items fall into obviously diverse subgroups in 
either content or difficulty, as, say, in the California Test of Mental 
Maturity, the tester’s judgment could provide a better-than-random 
split. It is dubious whether he could improve on a random division 
within subtests. 

It should be noted that in this empirical study no attempt was 
made to divide items on the basis of 7;:, as Gulliksen (18, p. 207-210) 
has recently suggested. Provided this is done on a large sample of 
cases other than those used to estimate 7;;, Gulliksen’s plan might 
indeed give parallel-split coefficients which are consistently at least a 
few points higher than a. 

The failure of the data to support our expectation led to a further 
study of the problem. We discovered that even tests which seem to 
be heterogeneous are often highly saturated with the first factor 
among the items. This forces us not only to extend the interpretation 
of a, but also to reexamine certain theories of test design. 

Factorial composition of the test variance. To make fully clear 
the relations involved, our analytic procedure will be spelled out in 
detail. We postulate that the variance of any item can be divided 
among k + 1 orthogonal factors (k common with other items and one 
unique). Of these, we shall refer to the first, f, , as the general factor, 
even though it is possible that some items would have a zero load- 
ing on this factor.* Then if f,; is the loading of common factor z on 
item i, 

100 — N?2(f?3; + foi + fs: + eee + f?u;) a (27) 
Cis = N? 035 075 (Fai Fag + Sei fog +--+ + Frei fez). (28) 
C.= SDC = NDS DSaioj hifi te +N? SS a: 0; fri fej; 
i $j 
(¢=1,2,---n—1l35g=it+1,---,n). (29) 
Vi=N?S 07 i (fai + +++ Pei + fv, ) + 2N? SS 94 05 fas fy 
i ij 


++» +2N?3> Dd 0: 0; fri faz (30) 
$j 
If n, items contain non-zero loadings on factor 1, and n, items 
contain factor 2, etc., then V; consists of 


aes *This factor may be a so-called primary or reference factor like Verbal, but 
it is more likely to be a composite of several such elements which contribute to 
every item. 
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n,? terms of the form N*o;o;f1if1; , plus 

n* terms of the form N?oic;feif2;, plus (31) 
n;” terms of the form N°ojojfsif3; , plus and so on to 

n2 terms of the form N°’o;io; frifx; , plus 

nm terms of the form N*o;*fy°. 


We rarely know the values of the factor loadings for an actual 
test, but we can substitute values representing different kinds of test 
structure in (30) and observe the proportionate influence of each 
factor in the total test. 

First we shall examine a test made up of a general factor and five 
group factors, in effect a test which might be arranged into five cor- 
related subtests. k = 6. Let n, = 1, so f, is truly general, and let 
Nz = Ng = % = Ns = N,— 1/57. To keep the illustration simple, we 
shall assume that all items have equal variances and that any factor 
has the same loading (f;) in all items where it appears. Then 





; n? n? n? 
Fee i + eet +e fp + + +f + Ze. (32) 
N?a;? 25 25 25 mas 


It follows that in this particular example, there are n? general factor 
terms, n?/5 group factor terms, and only n unique factor terms. There 
are, in all, 6n?/5 + n terms in the variance. Let /?.; be the proportion 
of test variance due to each factor. Then if we assume that all the 
terms making up the variance are of the same approximate magni- 
tude, 








5n? 5n 
Pas == — . (33) 
6n?+5n 6n+5 
5 . 
Lim frit _— 83. (34) 
n?/5 
fot = ++ = fee = —_.. (35) 
6n? + 5n 
Lim f%es = .08. (36) 
SfPur= : (37) 
i "* 6n + 5 


Lim 3 fo, +=0. (38) 
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Note that among the terms making up the variance of any test, the 
number of terms representing the general factor is n times the num- 
ber representing item specific and error factors. 

We have seen that the general factor cumulates a very large in- 
fluence in the test. This is made even clearer by Figure 2, where we 
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FIGURE 2 


Change in Proportion of Test Variance due to General, Group, and Unique 
Factors among the Items as n Increases. 


plot the trend in variance for a particular case of the above test struc- 
ture. Here we set k= 6,1, =, 2, = 23 =%—N; =n, —7N/5. Then 
we assume that each item has the composition: 9% general factor, 9% 
from some one group factor, 82% unique. Further, the unique vari- 
ance is divided by 70/12 between error and specific stable variance. It 
is seen that even with unreliable items such as these, which intercor- 
relate only .09 or .18, the general factor quickly becomes the predomi- 
nant portion of the variance. In the limit, as n becomes indefinitely 
large, the general factor is 5/6 of the variance, and each group factor 
is 1/30 of the total variance. 
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This relation has such important consequences that we work out 
two more illustrative substitutions in Table 5. We first consider the 
test which is very heterogeneous in one sense, in that each group of 
five items introduces a different group factor. No factor save factor 
1 is found in more than 5 items. Here great weight in each item is 
given to the group factor, yet even so, the general factor quickly 
cumulates in the covariance terms and outweighs the group factors. 

The other illustration involves a case where the general factor 
is much less important in the items than two group factors, each pres- 
ent in half the items. In this type of test, the general factor takes on 
some weight through cumulation, but the group factors do not fade 
into insignificance as before. We can generalize that when the pro- 
portion of items containing each common factor remains constant as 
a test is lengthened (factor loadings being constant also), the ratio 
of the variances contributed by any two common factors remains con- 
stant. That is, in such a test pattern each item accounts for a nearly 
constant fraction of the non-unique variance. 

While our description has discussed number of terms, and has 
simplified by holding constant both item variances and factor load- 
ings, the same general trends hold if these conditions are not imposed. 
The mathematical notation required is intricate, and we have not 
attempted a formal derivation of these general principles: 

If the magnitude of item intercorrelations is the same, on the 
average, in successive groups of items as a test is lengthened, 


(a) Specific factors and unreliability of responses on single items 
account for a rapidly decreasing proportion of the variance 
if the added items represent the same factors as the original 
items. Roughly, the contribution is inversely proportional 
to test length. 


(b) The ratio in which the remaining variance is divided among 
the general factor and group factors 
(i) is constant if these factors are represented in the added 
items to the same extent as in the original items ;* 


(ii) increases, if the group factors present in the original 
items have less weight in the added items. 


As a test is lengthened, the general factor accounts for a larger 
and larger proportion of the total variance. In the case where only a 
few group factors are present no matter how many items are added, 


*This is the case discussed in the recent paper of Guilford and Michael (17). 
Our conclusion is identical to theirs. 
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these also account for an increasing and perhaps substantial portion 
of the variance. But when each factor other than the first is present 
in only a few items, the general factor accounts for the lion’s share 
of the variance as the test reaches normal length. We shall return to 
the implications of this for test design and for homogeneity theory. 

Next, however, we apply this to coefficients of equivalence. We 
may study the composition of half-tests just as we have studied the 
total test. And we may also examine the composition of C,,, the be- 
tween-halves covariance. In Table 6, we consider first the test where 
there is a general factor and two group factors. If the test is divided 
into halves such that every item is factorially identical to its opposite 
number, save for the unique factor in each, the covariance C,» none- 
theless depends primarily upon the general-factor terms. Note, for 
example, the twenty-item test. Two-thirds of the covariance terms 
are the result of item similarity in the general factor. Suppose that 
these general factor terms are about equal in size. Then, should the 
test be split differently, the covariance would be reduced to the extent 
that more than half the items loaded with (say) factor 2 fall in the 
same half, but even the most drastic possible departure from the par- 
allel split would reduce the covariance by only one-third of its terms. 
In the event that the group-factor loadings in the items are larger 
than the general-factor loadings, the size of the covariance is reduced 
by more than one-third. It is in this case that the parallel split has 
special advantage: where a few group factors are present and have 
loadings in the items larger than the general factor does. 

The nature of the split has even less importance for the pattern 
where each factor is found in but a few items. Suppose, for exam- 
ple, that we are dealing with the 60-item test containing 15 factors 
in four items each. Then suppose that it is so very “badly” split that 
items containing 5 of the factors were assigned only to one of the half- 
tests, and items containing the second 5 factors were assigned to the 
other half-test. This would knock out 40 terms from the between- 
halves covariance, but such a shift would reduce the covariance only 
by 40/960 of its terms. Only in the exceptional conditions where gen- 
eral factor loadings are miniscule or where they vary substantially 
would different splits of such a test produce marked differences in the 
covariance. 

It follows from this analysis that marked variation in the coef- 
ficients obtained when a test is split in several ways can result only 


when 
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(a) a few group factors have substantial loadings in a large 
fraction of the items or 

(b) when first-factor loadings in the items tend to be very small 
or where they vary considerably. Even these conditions are likely to 
produce substantial variations only when the variance of a test is con- 
tributed to by only a few items. 


In the experimental tests studied by Clark, by Cronbach, and in 
the present study, general-factor loadings were probably greater, on 
the whole, than group-factor loadings. Moreover, none of the tests 
seems to have been divisible into large blocks of items each represent- 
ing one group factor. (Such large “lumps” of group factor content 
are most often found in tests broken into subtests, viz., the Number 
Series, Analogies, and other portions of the ACE Psychological ex- 
amination.) 

This establishes on theoretical grounds the fact that for certain 
common types of test, there is likely to be negligible variation among 
split-half coefficients. Therefore a, the mean coefficient, represents 
such tests as well as any parallel split. 

This interpretation differs from the Wherry-Gaylord conclusion 
(38) that “the Kuder-Richardson formula tends to underestimate the 
true reliability by the ratio (n — K)/(m — 1) when the number of 
factors, K , is greater than one.” They arrive at this by highly restric- 
tive assumptions: that all factors are present in an equal number of 
items, that no item contains more than one factor, that there is no 
general factor, and that all items measuring a factor have equal vari- 
ances and covariances. This type of test would never be intended to 
yield a psychologically interpretable score. For psychological tests 
where the intention is that all items include the same factor, our de- 
velopment shows that the quoted statement does not apply. 

The problem of differential weighting has been studied repeated- 
ly, the clearest mathematical analyses being those of Richardson (30) 
and Burt (4). This problem is closely related to our own study of test 
composition. Making different splits of a test is essentially the same 
as weighting the component items differently. The conditions under 
which split-half coefficients differ considerably are identical to those 
where differential weighting of components alters a total score appre- 
ciably: few components, lack of general factor or variation in its load- 
ings, large concentrations of variance in group factors. The more for- 
mal mathematical studies of weighting lead to the same conclusions 
as our study of special cases of test construction. 


4. How is a related to the homogeneity, internal consistency, or 
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saturation of a test?* During the last ten years, various writers (12, 
19, 27) directed attention to a property they refer to as homogeneity, 
scalability, internal consistency, or the like. The concept has not been 
sharply defined, save in the formulas used to evaluate it. The gen- 
eral notion is clear: In a homogeneous test, the items measure the 
same things. 

If a test has substantial internal consistency, it is psychologically 
interpretable. Two tests, composed of different items of this type, 
will ordinarily give essentially the same report. If, on the other hand, 
a test is composed of groups of items, each measuring a different fac- 
tor, it is uncertain which factor to invoke to explain the meaning of 
a single score. For a test to be interpretable, however, it is not es- 
sential that all items be factorially similar. What is required is that 
a large proportion of the test variance be attributable to the principal 
factor running through the test (37). 

a estimates the proportion of the test variance due to all common 
factors among the items. That is, it reports how much the test score 
depends upon general and group, rather than item specific, factors. If 
we assume that the mean variance in each item attributable to com- 


mon factors (3 o;? f.;7) equals the mean interitem covariance 





= (oi 9; fei fz), 





1 2 2 
— > Da? fa? =—_— > > Ci; = ——_- C>;.. (39) 
2 4 an(a~—i)< ; n(n—1) 
2 
DD 6? f.? = C:, (40) 
z it n—1 


and the total variance (item variance plus covariance) due to com- 





mon factors is 2 C:. Therefore, from (14), a is the proportion 


n—1 
of test variance due to common factors. Our assumption does not hold 
true when the interitem correlation matrix has rank higher than one. 
Normally, therefore, a underestimates the common-factor variance, 
but not seriously unless the test contains distinct clusters. 
The proportion of the test variance due to the first factor among 
the items is the essential determiner of the interpretability of the 


*Several of the comments made in the following sections, particularly re- 
garding Loevinger’s concepts, were developed during the 1949 APA meetings in 
a paper by Humphreys (21) and in a symposium on homogeneity and reliability. 
The thinking has been aided by subsequent discussions with Dr. Loevinger. 
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scores. ais an upper bound for this. For those test patterns described 
in the last section, where the first factor accounts for the preponder- 
ance of the common-factor variance, a is a close estimate of first-fac- 
tor concentration. 

a applied to batteries of tests or subtests. Instead of regarding 
a as an index of item consistency, we may apply it to questions of swb- 
test consistency. If each subtest is regarded as an “item” composing 
the test, formula (2) becomes 


al 


n { a J ‘oubrenus 
a= a. (41) 





n—1 


Here v is the number of subtests. If this formula is applied to a test 
or battery composed of separate subtests, it yields useful information 
about the interpretability of the composite. Under the assumption 
that the variance due to common factors within each subtest is on 
the average equal to the mean covariance between subtests, a indi- 
cates what proportion of the variance of the composite is due to com- 
mon factors among the subtests. In many instruments the subtests 
are positively correlated and intended to measure a general factor. 
If the matrix of intercorrelations is approximately hierarchical, so 
that group factors among subtests are small in influence, a is a meas- 
ure of first-factor concentration in the composite. 

Sometimes the variance of the test is not immediately known, 
but correlations between subtests are known. In this case one can 
compute covariances (C,» = oa op T.»), or the variance of the com- 
posite (V; is the sum of subtest variances and covariances), and ap- 
ply formula (41). But if subtest variances are not at hand, an in- 
ference can be made directly from correlations. If all subtests are 
assigned weights such that their variances are equal, i.e., they make 
equal contributions to the total, 


nN ij 
a— —— Js =1, 2m asset. 
m—1\n+2>>d rj 


J 





Here i and 7 are subtests, of which there are n. This formula tells 
what part of the total variance is due to the first factor among the 
subtests, when the weighted subtest variances are equal. 

A few applications will suggest the usefulness of this analysis. 
The California Test of Mental Maturity, Primary, has two part scores, 
Language and Non-Language. For a group of 725, according to the 








322 PSYCHOMETRIKA 


test authors, these scores correlate .668. Then, by (42), a, the com- 
mon-factor concentration, is .80. Turning to the Primary Mental 
Abilities Tests, we have a set of moderate positive correlations re- 
ported when these were given to a group of eighth-graders (35). The 
question may be asked: How much would a composite score on these 
tests reflect common elements rather than a hodgepodge of elements 
each specific to one subtest? The intercorrelations suggest that there 
is one general factor among the tests. Computing a on the assump- 
tion of equal subtest variances, we get .77. The total score is loaded 
to this extent with a general intellective factor. Our third illustra- 
tion relates to four Air Force scores related to carefulness. Each 
score is the count of number wrong on a plotting test. The four scores 
have rather small intercorrelations (15, p. 687), and each score has 
such low reliability that its use alone as a measure of carefulness is 
not advisable. The question therefore arises whether the tests are 
enough intercorrelated that the general factor would cumulate in a 
preponderant way in their total. The sum of the six intercorrela- 
tions is 1.76. Therefore a is .62. I.e., 62% of the variance in the 
equally weighted composite is due to the common factor among the 
tests. 

From this approach comes a suggestion for obtaining a superior 
coefficient of equivalence for the “lumpy” test. It was shown that a 
test containing distinct clusters of items might have a parallel-split 
coefficient appreciably higher than a. If so, we should divide the test 
into subtests, each containing what appears to be a homogeneous 
group of items. a is computed for each subtest separately by (2). 
Then o;?a gives the covariance of each cluster with the opposite clus- 
ter in a parallel form, and the covariance between subtests is an esti- 
mate of the covariance of similar pairs “between forms.” Hence 


DD 9% 95 Vij 


a eee ee ee 43) 
‘s V; ' 


where a; is entered for r;; , i and 7 being subtests. To the extent that 


a; is higher than the mean correlation between subtests, the parallel- 
forms coefficient will be higher than a; computed from (2). 

The relationships developed are summarized in Figure 8. a falls 
somewhere between the proportion of variance due to the first factor 
and the proportion due to all common factors. The blocks represent- 
ing “other common factors” and “item specifics” are small, for tests 
not containing clusters of items with distinctive content. 
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rParallel-forms coefficient 


rGreatest split-half coeffi- 
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Certain Coefficients related to the Composition of the Test Variance. 


An index unrelated to test length. Conceptually, it seems as if 
the “homogeneity” or “internal consistency” of a test should be in- 
dependent of its length. A gallon of homogenized milk is no more 
homogeneous than a quart. a increases as the test is lengthened, and 
so to some extent do the Loevinger-Ferguson homogeneity indices. 
We propose to obtain an indication of interitem consistency by apply- 
ing the Spearman-Brown formula to a; , thereby estimating the mean 
correlation between items. The formula is entered with the recipro- 
cal of the number of items as the multiple of test length. The formula 
can be simplified to 








a 
T ij (eat) — (44) 
n+ (1—n)a 
or (cf. 24, p. 213 and 30, p. 387), 
: 1 V:— dV; 
T ij (est) — , . (45) 
n—1 Vi 


Fijiest) (7 bar) is the correlation required, among items having equal 
variances and equal covariances, to obtain a test of length » having 
common-factor concentration a. Tijies:) or its special case ¢$ for di- 
chotomously-scored items is recommended as an overall index of in- 
ternal consistency, if one is needed. It is independent of test length. 
It is not, in my opinion, important for a test to have a high 7 if a is 
high. Woodbury’s “standard length” (39) is an index of internal con- 
sistency which can be derived from 7;; and has the same advantages 
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and limitations. n; , the standard length, is the number of items which 
yields an a of .50. Then 





(46) 


If 7 is high, a is high. But a may be high even when items have 
small intercorrelations. If 7 is low, the test may be a smooth mixture 
of items all having low intercorrelations. In this case, each item 
would have some loading with the general factor and if the test is 
long a could be high. Such items are illustrated by very difficult psy- 
chophysical discriminations such as a series of near-threshold speech 
signals to be interpreted; with enough of these items we have a high- 
ly satisfactory measuring instrument. In fact, save for random er- 
ror of performance, it may be unidimensional. A low value of 7 may 
instead indicate a lumpy test composed of discrete and homogeneous 
subtests. Guttman (34, p. 176n.) describes a questionnaire of this 
type. The concept of homogeneity has no particular meaning for a 
“lumpy” test. It is logically meaningless to inquire whether a set of 
ten measures of physical size plus ten intercorrelated vocabulary 
items is more homogeneous than twenty slightly correlated biographi- 
cal questions. A high 7 is sufficient but not necessary evidence that 
the test lacks important group factors. When 7 is low, only a study of 
correlations among items or trial clusters of items shows whether the 
test can be broken into more homogeneous subtests. 

Comparison with the index of reproducibility. Guttman’s coeffi- 
cient of reproducibility has appeared to some reviewers (Loevinger, 
28; Festinger, 13) as an ad hoc index with no mathematical rationale. 
It may therefore be worthwhile to note that this coefficient can be 
approximated by a mathematical form which makes clear what it 
measures. The correlation of any two-choice item with a total score 
on a test may be expressed as a phi coefficient, and this is common in 
conventional item analysis. Guttman dichotomizes the test scores at 
a cutting point selected by inspection of the data. We will get similar 
results if we dichotomize scores at that point which’ cuts off the same 
proportion of cases as pass the item under study. (Our ¢;; will be 
less in some cases than it would be if determined by Guttman’s inspec- 
tion procedure.) Simple substitution in Guttman’s definition (34, p. 
117) leads to 





R = 1—2o0;7(1 — diz), : (47) 


where the approximation is introduced by the difference in ways of 
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dichotomizing. The actual R obtained by Guttman will be larger than 
that from (47). For multiple-alternative items, a similar but more 
complex formula involving the phi coefficient of the alternative with 
the test is required to approximate Guttman’s result. R is independ- 
ent of test length; if a Guttman scale is divided into equivalent por- 
tions, the two halves will have the same R as the original test. In this 
respect, R is most comparable to our 7. Both ¢;; and 7 are low, so 
long as items are unreliable or contain substantial specific factors. 

5. Is the usefulness of a limited by properties of the phi coef- 
ficient between items having unequal difficulties? The criticism has 
been made, most vehemently by Loevinger (27), that a is a poor in- 
dex because, being based on product-moment correlations, it cannot 
attain unity unless all items have distributions of the same shape. 
For the pass/fail item, this requires that all p; be equal. The infer- 
ence is drawn that since the coefficient cannot reach unity for such 
items, a and 7 do not properly represent homogeneity. 

There are two ways of examining this criticism. The simpler is 
empirical. The alleged limitation upon the product-moment coefficient 
has no practical effect upon the coefficient, for items of the sort cus- 
tomarily employed in psychological tests. To demonstrate this, we 
consider the change in ¢ with changes in item difficulty. To hold con- 
stant the relation between the “underlying traits,” we fix the tetra- 
choric correlation. When the tetrachoric coefficient is .30, pi = .50 
and p; ranges from .10 to .90, 4:; ranges only from .14 to .19. Figure 4 
shows the relation of ¢:; to p; and p; for three levels of correlation: 
Tret = .80, Tre. = .50, and 7,., =.80. The correlation among items in 
psychometric tests is ordinarily below .30. For example, even for a 
five-grade range of talent, the rw for the California Test of Mental 
Maturity subtests range only from .13 to .25. That is, for tests hav- 
ing the degree of item intercorrelation found in present practice, ¢ 
is very nearly constant over a wide range of item difficulties. 


TABLE 7 


Variation in Certain Indices of Interitem Consistency with Changes in Item 
Difficulty (Tetrachoric Correlation Held Constant) 


P; 50 50 50 50 50 50 50 «50 50 
P; >.00 10 20 40 50 60 80 90 1.00 
—— 30 380 380 380 320 80 80 © 80 30 
$i; >.00 14 17 #19 «#19 19 #17 «614 © =6-.00 


H;; —1.00 42 34 29 19 23 34 42 1.00 
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Examining Loevinger’s proposed coefficient of homogeneity (29), 

i; = $ij/$ij max) » (48) 

we find that it is markedly affected by variations in item difficulty. 

One example is worked out in Table 7. As many investigators includ- 

ing Loevinger have noted, Guttman’s R is drastically affected by item 

difficulty. For any single item, R must be greater than p; or qg; , Which- 

ever is greater. Evidently the indices of homogeneity which might re- 

place ¢ suffer more from the effects of differences in difficulty than 
does the phi coefficient. 

Further evidence on the alleged limitation of a is obtained by 
preparing four hypothetical 45-item tests. In each case, all 7i;(te:) are 
fixed at .30. Phi coefficients reflect both heterogeneity in content and 
heterogeneity in difficulty. To assess the effect of the latter hetero- 
geneity upon ¢ and a, we compared one test of uniform item difficulty, 
where all heterogeneity is in content, with another where “hetero- 
geneity due to difficulty” was allowed to enter. As Table 8 indicates, 
even when extreme ranges of item difficulty are allowed, neither ¢ 
nor a is affected in any practically important way. For tests where 
item difficulties are higher, or correlations are lower, the effect would 
be even more negligible. 


TABLE 8 


Comparison of ¢ and a for Hypothetical 45-Item Tests With and Without 
“Heterogeneity Due to Item Difficulty” 


Distribution of Range of 


Test Difficulties DP; Pi ¢ Diff. a Diff. 
A Normal -20 to .80 .50 181 011 .909 005 
A’ Peaked 50 .50 192 .914 
B Normal -10 to .90 50 176 016 -906 008 
B’ Peaked .50 .50 192 914 
Cc Normal 50 to .90 -70 .170 011 .902 007 
C’ Peaked -70 -70 181 .909 
D Rectangular -10 to .90 .50 153 039 892 022 
D’ Peaked .50 50 192 914 


Still another small study leading to the same essential conclusion 
was made by examining a “perfect scale,” where all p;; equal $i; max) - 
Items were placed at five difficulty levels, the p; being .50, .58, .71, 
.80 , and .89. Then the correlations (phis) of items range from 1.00 
(at same level) to .85 (highest between levels) to .36. In a test of 
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only five items, a reaches .86. This is the maximum a could have, for 
this set of 5 items and specified p; . As the number of items increases, 
a rises toward 1.00. Thus, for 10 items, two at each level amax = .951; 
for 20 items, .977. It follows that even if items are much more ho- 
mogeneous in content than present tests and much freer from error, 
the cumulative properties of covariance terms make the failure of all 
¢’s to reach unity of next-to-no importance. amax would be lower if 
difficulties range over the full scale, but the same principle holds. a 
is a good measure of common-factor concentration, for tests of rea- 
sonable length, in spite of the fact that it falls short of 1.00 if items 
vary in difficulty. 

In the case of the perfect scale, of course, d does fall well short 
of unity and for such tests it does not reflect the homogeneity in con- 
tent. From the five-item case just considered, ¢ is .54. 

The second way to analyze this criticism is to examine the nature 
of redundancy (using a term from Shannon’s information theory, 
32). If two items repeat the same information, they are totally re- 
dundant. Thus, if one item divides people 50/50, and the second item 
does also, the two items always placing exactly the same people to- 
gether, the second item gives no new information about individual 
differences. (Cf. Tucker, 36). Suppose, though, that the second item 
is passed by 60 per cent of the subjects. Even if 7j;(:<.) = 1.00, this 
second item conveys new information because it discriminates among 
the fifty people who failed the first item. A five-item test where all 
items have perfect tetrachoric intercorrelations, and the p; are .40, 
45, .50,.55, .60, is perfectly homogeneous (a la Guttman, Loevinger, 
et al). So is a ten-item test composed of these items plus five others 
whose p’s are .30, .85, .65, .70, .75. The two tests are not equiva- 
lent in measuring power, however; the second makes a much greater 
number of discriminations. Because there is less redundancy, the 
longer test has a lower ¢. 

From the viewpoint of information theory, we should be equally 
concerned with heterogeneity in content and heterogeneity in difficulty. 
We get one bit of information when we place the person as above the 
mean in (say) pitch discrimination. Now with another item or set 
of items, we might place him relative to the mean in visual acuity. 
The two tests together place him in one of four categories. If our sec- 
ond test had been a further measure of pitch, placing the subject 
above or below the 75th percentile, then the two tests would have 
placed him in one of four categories. Either set of tests gives the 
same amount of information. Which information we most want de- 
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pends on practical considerations. 

The phi coefficient reports whether a second item gives new in- 
formation that the first does not. Then a tetrachoric 7 must be com- 
puted to determine if the new information relates to a new content 
dimension or to a finer discrimination on the same content dimension. 
If the phi coefficient between true scores is 1.00, redundancy is com- 
plete and there is no new information. Redundancy is desirable when 
accuracy of a single item is low. To test whether men can hear a 10- 
cycle difference, the best way is to use a large number of items of 
just that difficulty. Such items usually also discriminate to some de- 
gree at other points on the scale, but cannot give information about 
ability at the 5-cycle level if a single item is extremely reliable. 
With very accurate items a pitch test which is not homogeneous will 
be better for differentiation all along the scale. The “factors” found 
by Ferguson (11) due to the higher correlation (redundancy) of items 
with equal difficulty need not be regarded as artifacts (38).* These 
“difficulty factors” are factors on which the test gives information 
and on which the tester may well want information. They are not 
“content factors,” but they must be considered in test analysis. For 
example, if one regards pitch tests in this light, it is seen that a test 
containing 5-cycle items, 10-cycle items, and 15-cycle items will be 
slightly influenced by undesired factors, when the criterion requires 
discrimination only at the 15-cycle level. (Problems of this type oc- 
cur in validating tests for selecting military personnel using detec- 
tion apparatus). One would maximize the loading in the test of the 
group factor among 15-cycle items, to maximize validity. This factor 
is of course a mathematical factor, and not a property of the auditory 
machinery. While the mathematics is not clear, it seems very likely 
that the group factors found among phi coefficients are interchange- 
able with Guttman’s “components of scale analysis” to which he gives 
serious psychological interpretation. 

From this point of view, the phi coefficient which tells when items 
do and do not duplicate each other is a better index just because it 
does not reach unity for items of unequal difficulty. Phi and 7, are 
both useful in test analysis. Brogden (1, pp. 199, 201) makes a simi- 
lar point, although approaching the problem from another tack. 


*It is not necessary, as Ferguson seems to think, for difficulty factors to 
emerge if product-moment correlations are used with multi-category variates. 
On a priori grounds, difficulty factors will appear only if the shapes of the dis- 
tributions of the variates are different. In Ferguson’s data it appears likely that 
the hardest and easiest tests were skewed in opposite directions. 
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Implications for Test Design 

In view of the relations detailed above, we find it unnecessary to 
create homogeneous scales such as Guttman, Loevinger, and others 
have urged. 

It is true that a test where all items represent the same content 
factor with no error of measurement is maximally interpretable. 
Everyone attaining the same score would mark items in the same way. 
Yet the question we really wish to ask is whether the individual dif- 
ferences in test score are attributable to the first factor within the 
test. If a large proportion of the score variance relates to this factor, 
the residue due to specific characteristics of the items little handicaps 
interpretability. It has been shown that a high first-factor saturation 
indicated by a high a can be attained by cumulating many items which 
have low correlations. The standard proposed by Ferguson, Loeving- 
er, and Guttman is unreasonably severe, since it would rule out tests 
which do have high first-factor concentrations. 

These writers seem to wish to infer the person’s score on each 
item from his total score. This appears unimportant, but even if it 
were important, the interest would attach to predicting his true stand- 
ing on the item, not his fallible obtained score. For the unreliable 
items used in psychological and educational tests, the aim of Guttman 
et al. will not be approached in practice. Perhaps sociological data 
have such greater reliability that prediction of obtained scores is tan- 
tamount to predicting true scores. 

Increasing interpretability by lengthening a test is not without 
its disadvantages. Using more and more time to get at the same in- 
formation employs the principle of redundancy (32). When a mes- 
sage is repeated over and over, it is easier to infer the true message 
even when there is substantial interference (item unreliability). But 
the more you repeat messages already transmitted, the less time is 
allowed for conveying other information. A set of redundant items 
can carry much less information than a set of independent items. In 
other words, when we lengthen certain tests or subtests to make their 
scores more interpretable, we sacrifice the possibility of obtaining 
separate measures of additional factors in the same time. 

From the viewpoint of both interpretability and efficient predic- 
tion of criteria, the smallest element on which a score is obtained 
should be a set of items having a substantial a and not capable of 
division into discrete item clusters which themselves have high a. 
Such separately interpretable tests can sometimes be combined into 
an interpretable composite, as in the case of the PMA tests. Although 
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it is believed that the test designer should seek interitem consistency, 
and judge the effectiveness of his efforts by the coefficient a , the pure 
scale should not be viewed as an ideal. It should be remembered that 
Tucker (36) and Brogden (1) have demonstrated that increases in 
internal consistency may lead to decreases in the product-moment 
validity coefficient when the shape of the test-score distribution dif- 
fers from that of the criterion distribution. 


Summary 


1. Formulas for split-half coefficients of equivalence are com- 
pared, and those of Rulon and Guttman are advocated for practical 
use rather than the Spearman-Brown formula. 


2. a, the general formula of which Kuder-Richardson formula 
20 is a special case, is found to have the following important mean- 
ings: 

(a) ais the mean of all possible split-half coefficients. 

(b) ais the value expected when two random samples of 
items from a pool like those in the given test are correlated. 

(c) ais a lower bound for the coefficient of precision (the 
instantaneous accuracy of this test with these particular items). 
is also a lower bound for coefficients of equivalence obtained by simul- 
taneous administration of two tests having matched items. But for 
reasonably long tests not divisible into a few factorially-distinct sub- 
tests, a is nearly equal to “parallel-split” and “‘parallel-forms” coeffi- 
cients of equivalence.* 

(d) aestimates, and is a lower bound to, the proportion of 
test variance attributable to common factors among the items. That 
is, it is an index of common-factor concentration. This index serves 
purposes claimed for indices of homogeneity. a may be applied by a 
modified technique to determine the common-factor concentration 
among a battery of subtests. 

*W. G. Madow suggests that the amount of disagreement between two ran- 
dom or two planned samples of items from a larger population of items could be 
anticipated from sampling theory. The person’s score on a test is a sample mean, 
intended to estimate the population mean or “true score” over all items. The vari- 
ance of such a mean from one sample to another decreases rapidly as the sample 
is enlarged by lengthening the test, whether samples are drawn at random or are 
drawn after stratifying the universe as to difficulty and content. The conditions 
under which the radom splits correlate about as highly as parallel splits are those 
in which stratified sampling has comparatively little advantage. Madows comment 
has implications also for the preparation of comparable forms of tests and for 
developing objective methods of selecting a sample of items to represent a larger 


set of items so that the variance of the difference between the score based on the 
sample and the score based on the universe of items is as small as possible. 
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(e) ais an upper bound to the concentration in the test of 
the first factor among the items. For reasonably long tests not divis- 
ible into a few factorially-distinct subtests, a is very little greater 
than the exact proportion of variance due to the first factor. 


3. Parallel-splits yield coefficients little larger than random 
splits, unless tests contain large blocks of items representing group 
factors. For such tests, a computed for separate blocks and combined 
by a special formula gives a satisfactory estimate of first-factor con- 
centration. 


4. Interpretability of a test score is enhanced if the score has a 
high first-factor concentration. A high a is therefore to be desired, 
but a test need not approach a perfect scale to be interpretable. Items 
with quite low intercorrelations can yield an interpretable scale. 


5. A coefficient 7;; (or $i;) is derived which is the intercorrela- 
tion required, among items with equal intercorrelations and variances, 
to reproduce a test of m items having common-factor concentration a. 


¢@, aS a measure of item interdependence, draws attention to hetero- 
geneity in both difficulty and content factors. Heterogeneity in test 
difficulty merits the attention of the test designer, since the validity of 
the test may be increased by capitalizing on “difficulty factors” pres- 
ent in the criterion. 

6. To obtain subtest scores for interpretation or to be weighted 
in an empirical composite, the ideal set of items is one having a sub- 
stantial a and not further divisible into a few discrete smaller blocks 
of items. 
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Formulas are developed for estimating a point-biserial r or a 
tetrachoric r from an obtained phi coefficient. The estimate of a 
tetrachoric r, which is called rg, is shown to be equivalent to that 


obtained from first-order use of the tetrachoric r series. A tabula- 
tion is made of corrections needed to make 74 equivalent numerically 


to the tetrachoric 7. In spite of its greater generality than estimates 
of tetrachoric r by previous methods, there are limitations, which 
are pointed out. 


In recent years there has been an increasing recognition of the 
utility of the Boas-Yule phi coefficient as an index of correlation. This 
is evident in the greater attention it receives in general textbooks on 
statistical methods (2, 3,5). Some of this attention is due to the in- 
creased importance of categorical data and the more extensive, rig- 
orous use of them in research in the social sciences. Some is consistent 
with the view that responses to test items should be regarded opera- 
tionally as categorical data calling for corresponding treatment in 
item analysis. 


Needs for Estimation of Other Indices of Correlation from Phi 


We first mention some examples of occasions for estimating other 
types of correlation coefficients when only a phi coefficient is known. 

There are a number of circumstances under which this happens. 
Data in which both variables come to us dichotomized yield a phi co- 
efficient with little computation. We may decide that both of the vari- 
ables are actually continuous, the regression is rectilinear, and dis- 
tributions of continuous measurements of an appropriate type would 
be normal in the population. A tetrachoric correlation coefficient 
would be obviously called for, as the equivalent of the Pearson r. But 
perhaps we have already computed a phi coefficient, or we prefer the 
latter for computational reasons. Can we estimate the tetrachoric r 
from the phi coefficient with sufficiently tolerable accuracy to justify 
its use in this situation? 
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Another situation arises in connection with the computation of 
intercorrelations by means of the IBM tabulator. The distributions 
of some of the variables may be so skewed or otherwise irregular that 
Pearson r’s would be out of the question. Such irregularities are 
likely to introduce non-linearity or heteroscedasticity or both (3, p, 
171). Yet, it may well be assumed that the population distributions 
on ideal scales would be normal. One practical solution would be to 
code the scores in such distributions as 1 (above-median) and 0 (be- 
low-median). The product-moment correlation of two such variables 
gives a phi coefficient. Yet it is a value equivalent to a Pearson r that 
we want, to represent the genuine degree of relationship. 

In item-analysis studies, it is quite common procedure, for con- 
venience, to dichotomize the total-score distribution and to correlate 
with it the pass-fail dichotomy for each item. From the four-fold 
table we may compute either a tetrachoric 7 or a phi coefficient. The 
tetrachoric 7 is called for if we want to know how well the thing (or 
things) measured by the item correlates with the thing (or things) 
measured by the total score. If we want to know how well we can 
predict total score from the item dichotomy, however, the coefficient 
we want is a point-biserial r (3, p. 499). In other words the point- 
biserial r is the realistic one for determining how well the item func- 
tions in measurement. The realistic correlation coefficient for corre- 
lations between items would be the phi coefficient, if we want to know 
how well we could predict success on one item from success on an- 
other item. If we wanted to make a factor analysis based upon the 
intercorrelations of items, however, we would want tetrachoric 7’s. 

Having computed phi coefficients, either between items and total 
scores or between pairs of items, can we make tolerably accurate esti- 
mates of either point-biserial r’s or tetrachoric 7’s? If we can, the 
great economy of computing phi in the item-analysis situation gives 
it considerable appeal (3, p. 503). From it we could derive an esti- 
mate of either the point-biserial 7 or the tetrachoric 7. 


Common Correction Practices 


The procedures for estimating a tetrachoric r from phi as given 
in the textbooks have been unsatisfactory. The problem has been in- 
adequately approached from the point of view of correcting a co- 
efficient of correlation for coarse grouping. Peatman (5) recommends 
the division of ¢ by the constant .798 to estimate a point-biserial and 
by the constant .637 (which is .798*) to estimate a tetrachoric 7 . Guil- 
ford (3), makes a similar recommendation, with serious reservations. 
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These constants are called for under the general principle of correc- 
tion for coarse grouping when the category values are the means of 
the cases in the categories. It will be shown that these constants 
actually apply only under one set of conditions. Edwards (2) follows 
the approximation procedure given by Camp (1), which varies the 
correction constant according to the size of p, the proportion of cases 
in the category with the larger marginal frequency. 


A General Correction Formula 


We will now proceed to develop a formula for the estimation of 
a Pearson r from a phi coefficient and to show its relation to the se- 
ries for a tetrachoric r and its limits of accuracy. The basis will be 
the principle of correcting r for coarse grouping when there are two 
categories in X and Y, and when the index numbers are the means of 
values in segments of the distributions. It will be necessary to assume 
normal, continuous distributions in X and Y. 

According to the usual mathematical development, which is giv- 
en by Peters and Van Voorhis (6, p. 395), the correction factor for 
either variable is the standard deviation of the grouped values. Let us 
assume a unit normal distribution for X and for Y. Some needed 
parameters of either distribution will be: 


p' =the proportion of the N cases in the upper category of X , 

=the similar proportion on Y, 

q' =the proportion of the N cases in the lower category on X , 
=the similar proportion on Y, 

y' =the ordinate at the point of division on distribution X , and 


y =the similar ordinate in distribution Y. 


It is necessary, next, to express the standard deviations in terms 
of these parameters. With the distribution normal, the means of the 
two segments are given by the two ratios, y/p and y/q, for the dis- 
tribution on Y. The standard deviation of the distribution on Y is 
given by the equation 
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Applying both corrections to phi, 
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in which 7% is an estimate of a Pearson r from a known ¢ coefficient. 
It will, therefore, be seen that the correction factors are functions of 
p and p’, the marginal proportions. These functions are at a mini- 
mum when p = .5. Under this condition, a correction factor equals 
.798 , and only under the condition that p = p’ = .5 will the joint cor- 
rection factor be .637 . 

For greater convenience, it is better in practice to use the recip- 
rocals of the standard deviations. Equation (3) then becomes 


(Vp'a’) (V2q) 
To =¢ é ; (4) 
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Tabled values of these correction factors are available (3, p. 614). 
A point-biserial r would be estimated from phi by the use of 
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only one of these multiplying factors, since there is only one genuine- 
ly continuous distribution. A Pearson 7 could be estimated from a 
computed point-biserial 7 by multipiying the latter by one of these 
same factors. As a matter of fact, such a factor was previously shown 
to be the ratio of proportionality between a biserial 7 and a point- 
biserial 7 computed from the same data (3, p. 331), by relating the 
two formulas for those two coefficients. 


TABLE 1 
Correction Factors for Estimating r, from ¢ for 
Different Combinations of p and p’ 











VPq P 
p y’ 5 6 ae 8 9 
m3) 1.253 1.570 1.589 1.651 oe 2.141 
6 1.268 1.608 1.671 1.812 2.167 
a 1.318 TST 1.883 2.252 
8 1.429 2.042 2.442 
a 1.709 2.921 





To give some conception of how much error is introduced in using 
the constant .637 regardless of the values of p and p’ Table 1 is pre- 
sented. Here the correction factors of formula (4) corresponding 
to p values of .5, .6, .7, .8, and .9 are given, also the products of 
those factors for various combinations of p and p’. For the combina- 
tion of p= .5 and p’ — .5, the factor is 1.570, which is the reciprocal 
of .637. For the most extreme case in the table, when p = .9 and 
p’ = .9, the multiplier equals 2.921, which is almost twice that for 
the minimum product. The procedure suggested by Camp takes into 
account only one of the p values, whichever is larger. On that basis, 
his corrections vary much less with p than do those by the method 
proposed here. 

Let us substitute in equation (4) the usual expression for ¢ that 
is used in its computation. We then have 
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We thus arrive at what seems to be a new equation for estimating 
directly from a fourfold table of proportions the Pearson coefficient 
of correlation. 
If we start with the fourfold table of frequencies instead of pro- 
portions, formula (5) becomes 
ad — be 
’o = ° 
yy’ N? 
Equation (6) will look familiar to many, for the right-hand term is 
precisely the left-hand side of the text-book infinite series equation 
for tetrachoric 7, (3, p. 334) .* 
ad — be hk (h? — 1) (k? —1) 
=r+re— tr; ++ 
yy’ N? 2 6 
If the tetrachoric 7 is the best estimate of the Pearson r that can be 
obtained from a 2 X 2 table, then the amount of error involved in 
using the formulas given here in estimating the Pearson r is indicated 
by the contribution of all terms of the infinite series with powers high- 
er than the first. For small and even moderate values of 7; , there is 
probably little error in so doing. When 7; becomes large, however, 
the error becomes serious. We next turn our attention to estimating 
this discrepancy under different conditions. 


(6) 











Refinement of the Estimation 


This error, say ¢« = 7 — 7, can be conveniently obtained by 
iterative processes if 74 is less than .6, and the proportions in the 
dichotomies are not too far from .5. To illustrate the iteration let us 
‘solve for 7; for the data in Table 2. 











TABLE 2 
An Illustrative Problem 
Question I 
No Yes Total Proportion 
i 
g Yes 167 374 541 582 
2 
g No 203 186 389 .418 
<4 


Total 370 560 930 
Prop. 398 602 





*We are indebted to Mr. Russel F. Green for pointing out this identity. 
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The required infinite series* is given by the equation 


hk h? —1)(k?—1 
rantne (> )tno A cae (7) 





6 


where h and k are the z-scores of the points of division determining 
the dichotomies. In practice the coefficients for the polynomial in 7; 
are most conveniently obtained from Pearson’s Tables (4). 

For the data of Table 2, then, equation (7) becomes 








(374) (203) — (167) (186) (2070) (.2585) 
=; 7 | 
(.3905) (.3858) (930)2 ‘ 2 : 
44860 (.9322) (.9572) 
= .9443 = A .02675r;? Te 








(.1570) (864900) 
1, = 3443 — .026757,? — .14897;° — .01937r,4 — .05957,° — .01557;° 


where h = .2070 and k = .2585. 

As a first approximation to 7; , take 7, = .3443 and substitute on 
the right. Take the new result rv. and again substitute on the right ob- 
taining 7 etc. 


Tr, = 8443 — .0032 — .0061 — .0003 — .0003 — --- = .3344 
1; = .3443 — .0030 — .0056 — .0002 — .0008 — --- = .8352 
r, = .8443 — .0030 — .0056 — .0002 — .0003 — --- = .3352 . 


We may, therefore, take 7; = .885. Then .344 + « = .335 and 
e = —.009. 

In a manner similar to the above an epsilon table can be set up 
for values of vr», and convenient ranges of proportions in the two 
variables. In setting up these tables we clearly have the direction of 
either variable at our disposal. 

We use this freedom to restrict one set of proportions to be less 
than .5, and to insist that rs be positive. The second variable will 
then have proportions ranging from 0 to 1. If in any application a 
negative 7s occurs it is only necessary to consider the corresponding 
positive value in relation to the complementary proportion 1 — p. 
(See Table 4). 

The use of the tables is shown by the following examples. Assume 
that for a given 4-cell table 75 = .6, and that the corresponding pro- 


*For later reference Pearson’s form of this infinite series is 


d 
= Tolle) ro (e) +, (ide (ry + 9h) 79 (Rr y? + 74 (h) 7, (kre bo 
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portions in the variables are .3 and .5. Entering the section of the 


tables corresponding to r» = .6 with these proportions « = —.02. 
Hence r; = .58. 

For a second set of data assume 75 = —8, p= .3, p'=.7. This 
is equivalent to using 75 = .8, p= .8, and p’ = .3. Thuse=—.12, 
and tabular r, = .68. Hence, for the data r; = —.68. 


For values of rv» nearer 1 than .6, the iteration process described 
becomes impractical as the method must be repeated more and more 
often with twenty or more terms of the series becoming significant. 
For these larger values, then, we make use of Pearson’s Bivariate 
Tables (4), employing a graphical procedure to obtain results accu- 
rate to one significant figure. 


Y 


rar 























h 


FIGURE 1 





Pearson’s tables supply the quantity d/N for specified h, k, and 
r, as shown in Fig. 1. Specifically, h and k run by tenths from 0 to 
2.6 and 7; runs by intervals of .05 from 1.00 to —1.00. Since, in Pear- 

d/N — ro(h) 10(k) 
son’s notation, 75 = — it is possible to graph 7% as 
71 (h) 71(k) 

a function of chosen values of r; for fixed h and k. Then, for speci- 
fied 7% the graph can be read in reverse to yield 7;; a process which 
yields e=7; — 7%. 

We now illustrate the graphical procedure used for the special 
case h = k = 0; that is, for proportions of .5 in both variables. In par- 


ticular, then, Pearson’s equation becomes 
d/N a (0) To (0) 
To = ae 
[7 (0) ] , 
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for which [7,(0)]? = .1592 and [7(0)]* = .25. Hence, 7g = 6.28 
d/N — 1.571. With the aid of the latter equation and Pearson’s tables 
we can now set up the desired correspondence between chosen 7; , 


and 7g in Table 3. 




















TABLE 3 
r, as a Function of r, for p = p’ = .5 
r, a: -. 2. -£ 2 - «© 2 Ss 
d/N 266 .282 .298 .815 .883 .352 .8738 .398 .428 
6.28d/N 1.671 1.772 1.876 1.982 2.09 2.21 2.35 2.50 2.69 
rs 1002 .201 .805 .412 .524 .644 .775 .927 1.120 








: [ 
/ 
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rg asa F 


FIGURE 2 
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The next step in the process is to graph 7» as a function of 7; as 
shown in Fig. 2. 

As previously indicated the graph can now be used in reverse to 
obtain for Table 4 the 7;’s, for specified 7»’s, which will yield the de- 
sired «’s. 

For clarity, let us consider two instances of this process. From 
lines a and b in Fig. 2 it can be seen that if rs = .8, then r; = .72. 
Hence ¢ = — .08. Asa second illustration let 7, = .5. From lines ¢ 
and d in Fig. 2, 7, — .48. Hence e= 48 — .50 = — .02. 

It is interesting to note that the graphical method becomes in- 
accurate for small e , exactly the situation where iteration works best. 
Thus the two methods complement each other very well with excel- 
lent checks over the middle range of 75. 

Examination of Table 4 will show the limitations to the estima- 
tion of tetrachoric 7 by means of formulas (4) and (6). Only for 
small and moderate values of 7 can reasonably accurate estimates be 
made, and only when p and p’ are not extreme. The many vacant 
cells in the table indicate that estimates for higher values of r and 
for the more extreme values of p and p’ are determinate. Some of the 
larger correction values that are near those vacant areas should be 
used with hesitation. In spite of these limitations, however, there is 
much utility in the complete procedure given here. Most of the cor- 
relations obtained in the social sciences are moderate or low. The 
wise investigator also attempts to make divisions near the medians 
when he resorts to artifical dichotomies. Conditions are therefore, 
usually favorable for utilizing these estimation procedures well within 
the limitations mentioned. 


Illustrative Examples 
In the interest of clarification we now present specific examples 
of the general procedures which have been developed. 
The phi coefficient for the data of Table II can be obtained from 
the standard formula (3, p. 341) :— 


ad — be 
[(a+b)(a+c)(b+d)(e+d)]* 
Substituting and extracting square roots, 
44860 
— == 15 
(23.26) (23.66) (19.24) (19.72) 
By use of formula (4), 





= 





= 
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TABLE 4 
Values of ¢ for Different Combinations of rg, p, and p’* 
py 
9 8 b 6 5 A 3 2 | p r% 
.00 01 00 —.01 .00 .00 .00 .00 -00 5 
02 1 .00 .00 .00 .00 00 —.01 01 4 
.00 .00 01 .00 .00 00 —.01 —01 —.01 3 2 
.03 .02 .00 01 00 —.01 —.01 —.01 —.02 ay 
02 02 01 01 .00 01 —.01 —.02 —.02 a 
02 —01 —.01 —.01 —.01 —01 —.01 .00 01 m3) 
.03 01 O01 00 —0L —.02 —.02 —.02 —.038 A 
.06 .05 .02 04 —01 —.02 —03 —04 —.03 3 4 
2 Zz 04 01 00 —.02 —04 —05 —.06 pe 
2 BA .09 04 01 —.03 —.03 —.06 —.08 a! : 
—.01 00 —01 —02 —02 —02 —.02 —01 —.02 oO 
.08 .02 00 —.03 —.02 —.038 —.03 —.08 —.05 A 
iu eI .03 01 —02 —03 —.05 —.06 —.04 3 5 
2 .09 03 —.01 —.03 —.06 —.07 —.08 py. 
Bs .08 02 —.05 —.04 —08 —1 oe 
-04 01 —03 —.04 —.08 —.03 —.02 —.01 .04 5 
‘a .06 01 —.08 —03 —.05 —05 —04 —.07 4 
Al .06 01 —.02 —.05 —07 —.07 —.08 33 6 
4 05 —01 —.04 —07 —09 —1. 2 
od 04 —.07 —.08 —1 —.2 ok 
Ft 00 —.03 —04 —.05 —.05 —.04 —.02 07 m5) 
me 01 —.04 —.05 —.07 —.07 —.06 -00 4 
ok 01 —04 —07 —1 —1 —.08 3 Fy | 
1 —02 —06 —1 —1 — Z 
O07 00 —08 —2 —.2 rif 
01 —.05 —07 —.08 —.08 —.06 .00 5 
01 —.06 —.08 —.09 —.09 —.07 —.02 4 
02 —06 —09 —1 —1 —1 3 8 
—07 —1 —2 —2 BY? 
—.02 —1 —% —3 oll 
—.06 —.09 —.09 —.09 —.07 .00 m3) 
01 —06 —09 —1 —1 —.09 4 
—07 —1 —1 —1 —1 oO 85 
—09 —1 —2 — 2 iz 
—1 —2 — 2 sik 
—09 —1 —1 —1 —07 —.02 5 
—08 —1 —1 —1 —.1 4 
—07 —1 —1 —2 —1 oO 9 
—02 —1 —2 —2 — 2 2 
—i «2 «3 a 






































*e =r, — rg, hence r,; = rg + &. 
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1¢ = (.215) (1.263) (1.269) = .345. 


This checks well with the 7» of .344 obtained for the same data on 
page 341 through first-order use of the tetrachoric series. 

By use of Table 4 and interpolation, e = — .01 = 7; — 14. Hence, 
1, = .844 — .01 = .334. This checks well with the 7; of .335 obtained 
on page 341 through iterative use of the tetrachoric infinite series. 
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BOOK REVIEWS 


JOHN D. TRIMMER. Response of Physical Systems. New York: John Wiley and 
Sons, Inc., 1950, pp. ix + 268. 


Although this book is written primarily for workers in engineering and 
applied physics, the author hopes that it may be of use to workers in other sci- 
ences. It is only in the first chapter that application is made to any extent to 
problems other than those of applied physics. But the point of view which is de- 
veloped by the author, together with the mathematical results presented, would 
make it worthwhile for the biologist or psychologist who is interested in formu- 
lating his own problems in a quantitative, theoretical manner, to make use of this 
book. The text is written in such a way that only a limited knowledge of general 
physics and of differential equations is required in order to follow the principal 
ideas developed. 

In the first chapter, “A Pattern for Systems,” the problem to be treated is 
outlined. The author considers a “system” (e.g., apparatus, organism, mind, so- 
ciety) upon which there is a “forcing” (force, drive, motive, stimulus) which acts 
according to some “law” (equation, habit) to produce a “response” (action, out- 
put, mood, thought). The types of problems one may wish to solve can then be 
classified by the unknown. Thus, the direct problem is to find the response from 
the known forcing, system, and law. The converse problem is to find the forcing, 
the inverse problem is to find the properties of the system, and the inductive prob- 
lem is to find the law. In this chapter and elsewhere the difficulties of isolating 
the system and its parts are brought out. In the second chapter, “Physical Sys- 
tems,” is given a general discussion and a classification of systems together with 
definitions of terms. 

In chapters three, four, and six (First-Order, Second-Order, and Higher-Or- 
der Systems) a number of problems are solved. Among these are thermal, elec- 
trical, and hydraulic models, automatic speed control, production of radio-isotopes, 
and the nuclear chain reactor. The stability of a response and transient responses 
are also discussed in detail. 

Chapter five, “Sinusoidal Forcing cf Linear Systems,” gives the solution to 
a second-order system with a sinusoidal input. 

Chapter seven, “Measuring Instruments,” treats instruments as physical 
systems. Such problems as range, efficiency, and accuracy are discussed. Consid- 
erable attention is given to errors—mechanism, scale, environmental, dynamic, 
reading, measurement, determinate, sampling, and random errors. 

In chapter eight, “Feedback Systems,” thermal and electronic examples are 
used as illustrations of direct feedback, e.g., the response itself can directly 
change the magnitude of the forcing. An example of negative impedance is also 
given. In chapter nine, “Parametric Forcing,” the system is considered in which 
the response is altered by a change in the value of some parameter. Examples 
used are the condenser microphone and the nuclear chain reactor. If the response 
is able to alter the value of some parameter, the result is a parametric feedback 
system. 
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In chapter ten, “Distributed Systems,” problems are treated in which the 
values of one or more of the properties of the system are distributed continuously 
in space in such a manner that this fact cannot be ignored. This leads to partial 
differential equations which are briefly discussed. Applications are made to wave 
motion and to the cable problem. 

The last chapter, “Nonlinear Systems,” treats principally the problem of a 
charged particle in a force field and the problem of the oscillator. 

In addition to an appendix of problems, there is an appendix on the use of 
the Laplace transform in the study of linear systems. 


The University of Chicago H. D. Landahl 


PALMER O. JOHNSON. Statistical Methods in Research. New York: Prentice-Hall, 
Inc., 1949, pp. xvii + 377. 


The reviewer wishes to congratulate Professor Johnson on the production of 
a statistical text that sets a new benchmark of merit in the field of statistical 
psychology. The author aimed to write a non-mathematical but essentially ac- 
curate presentation based on the contributions of such statisticians as R. A. 
Fisher, E. S. Pearson, J. Neyman, S. S. Wilks, H. Hotelling, etc. The large set 
of references to original articles bears witness to the diligence with which he 
has culled the statistical field for theories and techniques which would be of use 
to the research worker in psychology or education. How well he has succeeded 
can best be judged by examining the topics covered. 

The book opens with a short discussion on the realm of statistics in which 
the author emphasizes that the student must be taught (1) how to choose the 
most effective statistical tool for the purpose in mind (2) the basic assumptions 
underlying the statistical tool selected (3) how to test whether these basic as- 
sumptions are fulfilled by the particular situation to which the tool is applied. 
The author adheres to this viewpoint rigorously in the greater part of the book; 
for example, all analysis of variance examples give as their first step, the testing 
of the assumption of homogeneity of variance. 

A short discussion of ‘Probability and Likelihood’ which follows contains a 
brief account of Bayes Theorem and maximum likelihood. Maximum-likelihood 
_ estimates are used with great frequency in the latter parts of the book. 

The next chapter, on ‘Sampling Distributions,’ gives us the sampling distribu- 
tions of such frequently used statistics as the mean; the variance, Student’s ¢; 
the correlation coefficient; Fisher’s z ratio; the binomial distribution; etc. This 
chapter characterizes the approach of the author to advanced problems in mathe- 
matical statistics. Exact formulas are given, but not derived if unfamiliar mathe- 
matical concepts like Gamma functions are involved, or mathematics beyond dif- 
ferential calculus is necessary. All of these sampling distributions are used in 
a later chapter which summarizes the simple useful tests of significance. These 
chapters include some relatively unknown tests such as the L, for the homo- 
geneity of variances and Hotelling’s test for the equality of two correlation co- 
efficients from the same sample. It is of interest to note that Hotelling’s test can 
be generalized to test for the equality of correlation coefficients in any column 
of a correlation matrix. 

A short explanation of the Neyman-Pearson theory of testing statistical 
hypotheses precedes the chapter giving examples of tests of significance. The 
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modern statistical theory of inference is competently summarized in terms of the 
concepts of the null hypothesis, Type I and Type II errors, critical regions, and 
the power of the likelihood ratio. In accordance with the author’s principles, the 
presentation is on a verbal level, relying on logical formulation rather than rig- 
orous mathematical derivations. Illustrative examples and empirical demonstra- 
tions are given at many points in the exposition. The mathematical symbols are 
used mainly for clear, concise, and correct formulations of these concepts. 

The next chapter deals with the techniques of estimating population para- 
meters by means of statistics. The maximum-likelihood method is used for point 
estimation; both Fisher’s fiducial method and the Neyman-Pearson method of 
confidence belts are used for interval estimation. Maximum-likelihood methods 
are applied to such novel problems as Jackson’s estimation of the reliability of 
a test, and Wilks’ test for the equivalence of two forms of a test with respect to 
means, variances, and covariances. The treatment of reliability follows Jackson 
and Hoyt in using the analysis of variance assumptions of additivity of compo- 
nents and equal error variances. To the reviewer, the analysis of variance prob- 
ability model seems distinctly inferior to the regression probability model for the 
estimation of true and error variance. Also, the assumption of homogeneity of 
the observed variances is unnecessarily restrictive. 

The next section, on testing the hypothesis of normality, includes some use- 
ful advice on how to normalize the set of observations in those cases where the 
distributions can be shown to be certain non-normal types. Then there is a unique 
chapter on those tests of significance which are ‘distribution-free’ or ‘nonpara- 
metric,’ i.e., which make no assumptions about the distribution of the observa- 
tions in the parent population. For most psychologists, the analysis of variance 
using ranked data, Kendall’s W coefficient of concordance, and the use of paired 
comparisons for calculating the agreement and consistency of subjects, will be 
new techniques. This chapter is a good idea and the reviewer hopes it will be ex- 
panded in the next edition to take in Kendall’s rank-order coefficient tau, simple 
devices like Tchebycheff’s inequality, and G. W. Brown’s analysis of variance 
using medians instead of means. 

A discussion of sampling is followed by a detailed mathematical presentation 
of the analysis of variance. The probability models for single classification and 
double classification (rows by columns) are set up, and the maximum-likelihood 
derivation is given. (Incidentally, neither here nor elsewhere is it explicitly 
pointed out that the maximum-likelihood method is equivalent to the least-squares 
technique only because of the assumption of normality.) A number of examples 
of the analysis of variance and covariance are presented. 

Two chapters are devoted to the principles and applications of experimental 
design with examples of experiments in psychology using randomized blocks, 
Latin squares, and factorial designs. It is gratifying to note that the probability 
model for each example is given in full for all applications in analysis of vari- 
ance and covariance. This method of teaching Fisher’s techniques far surpasses 
Fisher’s own presentation of his methods. By explicitly designating which para- 
meters are being tested for their deviation from zero, it allows the student to do 
his own reasoning instead of forcing him to follow some blind, rule-of-thumb 
handbook. To leave out the probability models, as most textbooks do, is to ensure 
that a certain proportion of students will go thru the ritual of analysis of vari- 
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ance with the religious conscientiousness of a devotee of number magic, and for 
the same mystical reasons. 

The book concludes with a succinct summary of multiple regression prob- 
lems. Unfortunately Fisher’s computational technique is used, which means that 
the complete inverse of the correlation matrix is computed each time. Even 
worse, it is not stated explicitly anywhere that it is the inverse of a matrix which 
is being computed. Tests of significance for the multiple correlation and the mul- 
tiple regression weights are given and illustrated. The two-group linear discrimi- 
nant function is derived using Mahalanobis’s D2. The relation of D? to Hotelling’s 
T2 is mentioned but not explained in detail. Wherry’s proof that Fisher’s linear 
discriminant function is exactly proportional to the regression of the dichotomous 
criterion of group membership on the independent variables is not mentioned. 
Consequently, it is not pointed out that testing D2 or T? for significance is alge- 
braically identical to testing a multiple correlation for significance. 

The book is designed for a year’s course of advanced statistics, and assumes 
that the student will have a knowledge of descriptive statistics and elementary 
calculus. The reviewer has been able to use the chapters on the testing of sta- 
tistical hypotheses as the basis of the third term in a year’s course in statistics. 
The students were clinical psychologists and for the most part knew no calculus. 
Yet, with Johnson’s treatment, it was possible to give a firm rational foundation 
to all tests of significance by using such concepts as maximum likelihood, null 
hypothesis, Type I and Type II errors, critical region, etc. 

However, the instructor will probably find the book of more use than will 
the student. Johnson has undertaken the arduous task of listing every reference 
that he has used. We thus have a compilation of almost all those articles which 
would be of interest to a statistical psychologist both from a theoretical and 
applied point of view. 

There are several points on which the reviewer finds himself in disagree- 
ment with the author. The difference between a one-tail and a two-tail test is 
never made explicit and apparent errors occur in the applications. For example 
in Problem V.1 (pp. 69-70) the question is asked ‘if the mean ability of the class 
is the same as that of the population.’ This question seems to call for a two-tail 
test since the direction of the difference is not specified. But a one-tail test is 

_used. Johnson has stated (in a personal communication to the reviewer) that 
the one-tail test must be used because the popuiation mean is known in this prob- 
lem. This does not seem to be adequate. If the one-tail test were used on each 
such case at, let us say, the 5% level, then, when the null hypothesis is true, 10% 
of the differences would be rejected as non-null differences. It seems to the re- 
viewer that the Neyman-Pearson theory demands that the level of confidence 
correspond exactly to the per cent of Type I errors. 

A similar case arises when the F ratio is used to test the homogeneity of 
two variances (p. 82). The tabled value of the F' ratio gives a one-tail test. But 
since the direction of the difference is not specified, a two-tail test is necessary. 
We must find the probability that an F' ratio would exceed s?,/s?, or be less than 
s?,/s?, (where s?, > s?,). The L, test for homogeneity of the two variances 
gives this same two-tail F test. A. M. Mood demonstrates the point very 
clearly in his recent text (Introduction to the Theory of Statistics. New York: 
McGraw-Hill Book Co., 1950, p. 268). 

The treatment of chi-square is utterly inadequate. This is all the more sur- 
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prising in view of the care taken to define other basic sampling distributions, 
such as t and Fisher’s z. The first mention of chi-square comes on page 37 in 
the middle of a discussion on the goodness of fit of sampled means to a known 
normal distribution. The inquiring student is referred to page 96 where he will 
find an example of the use of chi-square, but not a word of its meaning, or the 
basic sampling distribution it illustrates. But as Lewis and Burke have pointed 
out, statistical textbooks for psychologists seem to universally fail when the chi- 
square test is discussed (D. Lewis and C. V. Burke, Psychol. Bull., 1949, 46, 433- 
489). Their article, however, should go a long way towards rectifying this de- 
ficiency. 

In spite of these defects, the book remains unexcelled in its field. Author 
Johnson has managed to pack more good statistics into his book than appears in 
any other comparable text. Statistical Methods in Research is definitely required 
reading for all teachers of statistics in the fields of psychology and education. 


University College, London Ardie Lubin 


Howard W. GOHEEN and SAMUEL KAVRUCK, Selected References on Test Con- 
struction, Mental Test Theory, and Statistics, 1929-1949, Washington, D. C.: 
United States Civil Service Commission, 1950, pp. 209. 


This is an exceedingly valuable index of much of the literature in the fields 
of test theory, test construction, and statistics. Recommended for all members of 
the Society. 


University of Michigan Clyde H. Coombs 





















































